CloudWatch and Alerts

Introduction¶

This documentation outlines the workflow for collecting various metrics from your internal system, including data obtained from temporal workflows, AWS systems, and Kubernetes. The collected metrics are then visualized using AWS CloudWatch Dashboards. Additionally, we'll cover the setup of CloudWatch alarms, integrating them with Opsgenie for alerting, and how critical alerts are propagated to Slack channels and phone calls.

Note

Temporal Repo : Pheonix System Status
Branch : Master
Visit Dashboard

Temporal Workflows and Metrics Collection¶

Temporal Workflows and Activities to collects metrics¶

Temporal is a workflow orchestration framework. Implement workflows to perform periodic metric collection.
Temporal workflows can execute tasks that collect metrics from your internal system.
Metrics can include CPU usage, memory utilization, response times, and more.
Ensure appropriate authentication and permissions for accessing internal systems.

Activities in the workflow:¶

Check Name	Description	Metrics Emitted	Unit	Namespace
`Check_website`	Checks the status of the website. If down, emits `0`; if up, emits `1`.	Website Status	Count	Grepsr/CheckWebsite
`Check_webapp_v2`	Checks the status of the web application. If down, emits `0`; if up, emits `1`.	Webapp Status	Count	Grepsr/CheckWebapp
`Check_api`	Checks the status of the API. If down, emits `0`; if up, emits `1`.	API Status	Count	Grepsr/CheckAPI
`Check_crawler_api`	Checks the status of the crawler API. If down, emits `0`; if up, emits `1`.	Crawler API Status	Count	Grepsr/CheckCrawlerAPI
`Check_influxdb`	Checks the status of InfluxDB. If down, emits `0`; if up, emits `1`.	InfluxDB Health	Count	Grepsr/InfluxDB
`Check_export_queue`	Emits metrics for pending, queuing, and mismatched exports.	PendingExports, QueuingExportsDelay, WaitingMismatchExports	Count	Grepsr/Data/Processing
`Check_data_delivery_history`	Emits metrics for undelivered and missed exports.	UndeliveredExports, MissedExports	Count	Grepsr/Data/Processing
`Check_k8s_tasks`	Emits metrics for pending, active, and queued tasks in Kubernetes.	PendingTasks, ActiveTasks, ActiveProcessingScheduled, ActiveProcessingTaskQueued	Count	Grepsr/Crawler/Infrastructure/K8s
`Check_archived_histories`	Emits metrics for archived and non-archived histories.	Archived, NotArchived	Count	Grepsr/Data/Archival
`Check_proxy_usage`	Emits metrics for proxy usage costs across various zones.	OverallCost, OLA_ZONE-grepsr, HOLA_ZONE_dc_us, HOLA_ZONE_dc_shared, HOLA_ZONE_unblocker, HOLA_ZONEsov_walmart_com_placement, HOLA_ZONE_google_ads, HOLA_ZONE_walmart_com_evd_assortment	Count	Grepsr/Proxy/Cost
`Check_selenium_slots`	Emits metrics for total and free Selenium slots.	SeleniumTotalSlots, SeleniumFreeSlots	Count	Grepsr/Selenium
`Check_workflows`	Monitors workflows in the Temporal namespace and emits metrics for running, failed, and timed-out workflows.	RunningWorkflows, FailedWorkflows, TimeoutWorkflows	Count	Grepsr/TemporalStatus
`Check_workflows_grepsr_crawlers`	Monitors workflows in the Temporal namespace for crawlers and emits metrics for running, failed, and timed-out workflows.	RunningWorkflows, FailedWorkflows, TimeoutWorkflows	Count	Grepsr/TemporalStatusCrawlers
`Check_delayed_runs`	Emits metrics for delayed scheduled runs.	DelayedScheduledCount	Count	Grepsr/ScheduleRun
`Check_parallel_running_reports`	Emits metrics for parallel running reports and total parallel runs.	ParallelRunningReports, TotalParallelRuns	Count	Grepsr/Data/Processing

Workflow Name: CloudWatchMetricWorkflow

These activities are run every 10 minutes (This happens because the sleep time workflow's max-sleep-time-minutes is set to 10 minutes).

In one workflow, these activities repeat every 10 times. (workflow.max-cycle-count is set to 10 cycles). When this cycle completes, the workflow sets to continue as new where a new same workflow runs repeating the same.

Publishing Metrics to CloudWatch¶

Temporal workflows can publish collected metrics to AWS CloudWatch using the CloudWatch PutMetricData API.
Organize metrics with meaningful namespaces, dimensions, and metric names.
Configure appropriate timestamping for the metrics.

Metrics from AWS Systems and Kubernetes¶

Set up CloudWatch agents on AWS instances to automatically collect system-level metrics.
Configure metric forwarding to CloudWatch using the CloudWatch agent or custom scripts.

CloudWatch Dashboard and Graphs¶

Create an AWS CloudWatch Dashboard to visualize the collected metrics.
Design widgets to display various metrics using line charts, bar graphs, and more.
Organize the dashboard layout to provide an overview of system health and performance.

Setting Up Alarms and Thresholds¶

Creating CloudWatch Alarms¶

Define CloudWatch Alarms based on selected metrics.
Set threshold values that trigger alarms when metric values breach predefined limits.
Configure alarm actions to perform specific tasks when triggered

Integrating with Opsgenie¶

Use Amazon CloudWatch Alarms integration with Opsgenie for incident management.
Set up alert policies in Opsgenie to manage how alerts are handled.
Define escalation policies and routing rules for different types of incidents.

Alerting and Notification¶

Sending Alerts to Slack¶

Configure Opsgenie to send alerts to Slack channels.
Define the "Critical system notifications" Slack channel as a target for critical alerts.

Phone Call Alerts for Critical Incidents¶

Utilize Opsgenie's capabilities to trigger phone call alerts for critical incidents.
Configure on-call schedules and phone numbers to receive calls for urgent alerts.

Conclusion¶

This documentation has covered the workflow for collecting metrics from internal systems, AWS resources, and Kubernetes clusters. The collected metrics are visualized through AWS CloudWatch Dashboards, and alerting mechanisms are established using CloudWatch Alarms and Opsgenie integration. This setup ensures that critical incidents are promptly identified, communicated, and managed to maintain the overall health and performance of your system.