Skip to content

CloudWatch and Alerts

Introduction

This documentation outlines the workflow for collecting various metrics from your internal system, including data obtained from temporal workflows, AWS systems, and Kubernetes. The collected metrics are then visualized using AWS CloudWatch Dashboards. Additionally, we'll cover the setup of CloudWatch alarms, integrating them with Opsgenie for alerting, and how critical alerts are propagated to Slack channels and phone calls.

Note

Temporal Workflows and Metrics Collection

Temporal Workflows and Activities to collects metrics

  • Temporal is a workflow orchestration framework. Implement workflows to perform periodic metric collection.
  • Temporal workflows can execute tasks that collect metrics from your internal system.
  • Metrics can include CPU usage, memory utilization, response times, and more.
  • Ensure appropriate authentication and permissions for accessing internal systems.

Activities in the workflow:

Check Name Description Metrics Emitted Unit Namespace
Check_website Checks the status of the website. If down, emits 0; if up, emits 1. Website Status Count Grepsr/CheckWebsite
Check_webapp_v2 Checks the status of the web application. If down, emits 0; if up, emits 1. Webapp Status Count Grepsr/CheckWebapp
Check_api Checks the status of the API. If down, emits 0; if up, emits 1. API Status Count Grepsr/CheckAPI
Check_crawler_api Checks the status of the crawler API. If down, emits 0; if up, emits 1. Crawler API Status Count Grepsr/CheckCrawlerAPI
Check_influxdb Checks the status of InfluxDB. If down, emits 0; if up, emits 1. InfluxDB Health Count Grepsr/InfluxDB
Check_export_queue Emits metrics for pending, queuing, and mismatched exports. PendingExports, QueuingExportsDelay, WaitingMismatchExports Count Grepsr/Data/Processing
Check_data_delivery_history Emits metrics for undelivered and missed exports. UndeliveredExports, MissedExports Count Grepsr/Data/Processing
Check_k8s_tasks Emits metrics for pending, active, and queued tasks in Kubernetes. PendingTasks, ActiveTasks, ActiveProcessingScheduled, ActiveProcessingTaskQueued Count Grepsr/Crawler/Infrastructure/K8s
Check_archived_histories Emits metrics for archived and non-archived histories. Archived, NotArchived Count Grepsr/Data/Archival
Check_proxy_usage Emits metrics for proxy usage costs across various zones. OverallCost, OLA_ZONE-grepsr, HOLA_ZONE_dc_us, HOLA_ZONE_dc_shared, HOLA_ZONE_unblocker, HOLA_ZONEsov_walmart_com_placement, HOLA_ZONE_google_ads, HOLA_ZONE_walmart_com_evd_assortment Count Grepsr/Proxy/Cost
Check_selenium_slots Emits metrics for total and free Selenium slots. SeleniumTotalSlots, SeleniumFreeSlots Count Grepsr/Selenium
Check_workflows Monitors workflows in the Temporal namespace and emits metrics for running, failed, and timed-out workflows. RunningWorkflows, FailedWorkflows, TimeoutWorkflows Count Grepsr/TemporalStatus
Check_workflows_grepsr_crawlers Monitors workflows in the Temporal namespace for crawlers and emits metrics for running, failed, and timed-out workflows. RunningWorkflows, FailedWorkflows, TimeoutWorkflows Count Grepsr/TemporalStatusCrawlers
Check_delayed_runs Emits metrics for delayed scheduled runs. DelayedScheduledCount Count Grepsr/ScheduleRun
Check_parallel_running_reports Emits metrics for parallel running reports and total parallel runs. ParallelRunningReports, TotalParallelRuns Count Grepsr/Data/Processing

Workflow Name: CloudWatchMetricWorkflow

These activities are run every 10 minutes (This happens because the sleep time workflow's max-sleep-time-minutes is set to 10 minutes).

In one workflow, these activities repeat every 10 times. (workflow.max-cycle-count is set to 10 cycles). When this cycle completes, the workflow sets to continue as new where a new same workflow runs repeating the same.

Publishing Metrics to CloudWatch

  • Temporal workflows can publish collected metrics to AWS CloudWatch using the CloudWatch PutMetricData API.
  • Organize metrics with meaningful namespaces, dimensions, and metric names.
  • Configure appropriate timestamping for the metrics.

Metrics from AWS Systems and Kubernetes

  • Set up CloudWatch agents on AWS instances to automatically collect system-level metrics.
  • Configure metric forwarding to CloudWatch using the CloudWatch agent or custom scripts.

CloudWatch Dashboard and Graphs

  • Create an AWS CloudWatch Dashboard to visualize the collected metrics.
  • Design widgets to display various metrics using line charts, bar graphs, and more.
  • Organize the dashboard layout to provide an overview of system health and performance.

Setting Up Alarms and Thresholds

Creating CloudWatch Alarms

  • Define CloudWatch Alarms based on selected metrics.
  • Set threshold values that trigger alarms when metric values breach predefined limits.
  • Configure alarm actions to perform specific tasks when triggered

Integrating with Opsgenie

  • Use Amazon CloudWatch Alarms integration with Opsgenie for incident management.
  • Set up alert policies in Opsgenie to manage how alerts are handled.
  • Define escalation policies and routing rules for different types of incidents.

Alerting and Notification

Sending Alerts to Slack

  • Configure Opsgenie to send alerts to Slack channels.
  • Define the "Critical system notifications" Slack channel as a target for critical alerts.

Phone Call Alerts for Critical Incidents

  • Utilize Opsgenie's capabilities to trigger phone call alerts for critical incidents.
  • Configure on-call schedules and phone numbers to receive calls for urgent alerts.

Conclusion

This documentation has covered the workflow for collecting metrics from internal systems, AWS resources, and Kubernetes clusters. The collected metrics are visualized through AWS CloudWatch Dashboards, and alerting mechanisms are established using CloudWatch Alarms and Opsgenie integration. This setup ensures that critical incidents are promptly identified, communicated, and managed to maintain the overall health and performance of your system.