CloudWatch and Alerts
Introduction¶
This documentation outlines the workflow for collecting various metrics from your internal system, including data obtained from temporal workflows, AWS systems, and Kubernetes. The collected metrics are then visualized using AWS CloudWatch Dashboards. Additionally, we'll cover the setup of CloudWatch alarms, integrating them with Opsgenie for alerting, and how critical alerts are propagated to Slack channels and phone calls.
Note
- Temporal Repo : Pheonix System Status
- Branch : Master
- Visit Dashboard
Temporal Workflows and Metrics Collection¶
Temporal Workflows and Activities to collects metrics¶
- Temporal is a workflow orchestration framework. Implement workflows to perform periodic metric collection.
- Temporal workflows can execute tasks that collect metrics from your internal system.
- Metrics can include CPU usage, memory utilization, response times, and more.
- Ensure appropriate authentication and permissions for accessing internal systems.
Activities in the workflow:¶
| Check Name | Description | Metrics Emitted | Unit | Namespace |
|---|---|---|---|---|
Check_website |
Checks the status of the website. If down, emits 0; if up, emits 1. |
Website Status | Count | Grepsr/CheckWebsite |
Check_webapp_v2 |
Checks the status of the web application. If down, emits 0; if up, emits 1. |
Webapp Status | Count | Grepsr/CheckWebapp |
Check_api |
Checks the status of the API. If down, emits 0; if up, emits 1. |
API Status | Count | Grepsr/CheckAPI |
Check_crawler_api |
Checks the status of the crawler API. If down, emits 0; if up, emits 1. |
Crawler API Status | Count | Grepsr/CheckCrawlerAPI |
Check_influxdb |
Checks the status of InfluxDB. If down, emits 0; if up, emits 1. |
InfluxDB Health | Count | Grepsr/InfluxDB |
Check_export_queue |
Emits metrics for pending, queuing, and mismatched exports. | PendingExports, QueuingExportsDelay, WaitingMismatchExports | Count | Grepsr/Data/Processing |
Check_data_delivery_history |
Emits metrics for undelivered and missed exports. | UndeliveredExports, MissedExports | Count | Grepsr/Data/Processing |
Check_k8s_tasks |
Emits metrics for pending, active, and queued tasks in Kubernetes. | PendingTasks, ActiveTasks, ActiveProcessingScheduled, ActiveProcessingTaskQueued | Count | Grepsr/Crawler/Infrastructure/K8s |
Check_archived_histories |
Emits metrics for archived and non-archived histories. | Archived, NotArchived | Count | Grepsr/Data/Archival |
Check_proxy_usage |
Emits metrics for proxy usage costs across various zones. | OverallCost, OLA_ZONE-grepsr, HOLA_ZONE_dc_us, HOLA_ZONE_dc_shared, HOLA_ZONE_unblocker, HOLA_ZONEsov_walmart_com_placement, HOLA_ZONE_google_ads, HOLA_ZONE_walmart_com_evd_assortment | Count | Grepsr/Proxy/Cost |
Check_selenium_slots |
Emits metrics for total and free Selenium slots. | SeleniumTotalSlots, SeleniumFreeSlots | Count | Grepsr/Selenium |
Check_workflows |
Monitors workflows in the Temporal namespace and emits metrics for running, failed, and timed-out workflows. | RunningWorkflows, FailedWorkflows, TimeoutWorkflows | Count | Grepsr/TemporalStatus |
Check_workflows_grepsr_crawlers |
Monitors workflows in the Temporal namespace for crawlers and emits metrics for running, failed, and timed-out workflows. | RunningWorkflows, FailedWorkflows, TimeoutWorkflows | Count | Grepsr/TemporalStatusCrawlers |
Check_delayed_runs |
Emits metrics for delayed scheduled runs. | DelayedScheduledCount | Count | Grepsr/ScheduleRun |
Check_parallel_running_reports |
Emits metrics for parallel running reports and total parallel runs. | ParallelRunningReports, TotalParallelRuns | Count | Grepsr/Data/Processing |
Workflow Name: CloudWatchMetricWorkflow
These activities are run every 10 minutes (This happens because the sleep time workflow's max-sleep-time-minutes is set to 10 minutes).
In one workflow, these activities repeat every 10 times. (workflow.max-cycle-count is set to 10 cycles). When this cycle completes, the workflow sets to continue as new where a new same workflow runs repeating the same.
Publishing Metrics to CloudWatch¶
- Temporal workflows can publish collected metrics to AWS CloudWatch using the CloudWatch PutMetricData API.
- Organize metrics with meaningful namespaces, dimensions, and metric names.
- Configure appropriate timestamping for the metrics.
Metrics from AWS Systems and Kubernetes¶
- Set up CloudWatch agents on AWS instances to automatically collect system-level metrics.
- Configure metric forwarding to CloudWatch using the CloudWatch agent or custom scripts.
CloudWatch Dashboard and Graphs¶
- Create an AWS CloudWatch Dashboard to visualize the collected metrics.
- Design widgets to display various metrics using line charts, bar graphs, and more.
- Organize the dashboard layout to provide an overview of system health and performance.
Setting Up Alarms and Thresholds¶
Creating CloudWatch Alarms¶
- Define CloudWatch Alarms based on selected metrics.
- Set threshold values that trigger alarms when metric values breach predefined limits.
- Configure alarm actions to perform specific tasks when triggered
Integrating with Opsgenie¶
- Use Amazon CloudWatch Alarms integration with Opsgenie for incident management.
- Set up alert policies in Opsgenie to manage how alerts are handled.
- Define escalation policies and routing rules for different types of incidents.
Alerting and Notification¶
Sending Alerts to Slack¶
- Configure Opsgenie to send alerts to Slack channels.
- Define the "Critical system notifications" Slack channel as a target for critical alerts.
Phone Call Alerts for Critical Incidents¶
- Utilize Opsgenie's capabilities to trigger phone call alerts for critical incidents.
- Configure on-call schedules and phone numbers to receive calls for urgent alerts.
Conclusion¶
This documentation has covered the workflow for collecting metrics from internal systems, AWS resources, and Kubernetes clusters. The collected metrics are visualized through AWS CloudWatch Dashboards, and alerting mechanisms are established using CloudWatch Alarms and Opsgenie integration. This setup ensures that critical incidents are promptly identified, communicated, and managed to maintain the overall health and performance of your system.