Skip to content

Crawler Lifecycle Workflow using Temporal

Background

The current model of reflecting runs and their status is based on event-push-action-capture architecture. This model has endured the test of times, however, there have been lots of issues reported against it. We have come to understand that this model, for our use-case, has some flaws. The action that binds the Kubernetes cluster with our database, if missed, cannot be replayed. Consequently, these misses have left lots of runs not updated to their actual status. In the quest of searching for resilient and permanent (although, there is nothing permanent in tech, just long-term solution), we’ve built this workflow on temporal. We intend to heavily use this workflow to resolve issues that we’ve been having, and also bring visibility to the run statuses in a near-real-time manner.

This document describes the operational flow of a crawler from initiation to completion. The workflow integrates with various services and micro-services, including Kubernetes, Scheduler Micro-services, and databases such as MySQL, orchestrated by a central system potentially governed by Amazon Event-bridge.

Workflow Stages

The workflow is divided into several main stages: initiation, deployment, reconciliation, and completion.

1. Initiation

Trigger The workflow can be initiated via three different triggers:

Scheduled Run: Initiated by a time-based trigger, likely managed by a Scheduler Micro-service.
Manual Run: Initiated by a user or process manually.
API Run: Initiated by an external API call.

Upon initiation, a "Run Request" is generated.

2. Deployment

Start Workflow A "Start Workflow" process begins which may encompass the allocation of resources and the setting up of the environment for the crawler to run.

Deploy Job Process The system then deploys the job processes, which likely involves container orchestration using Kubernetes. A unique "Job ID" is assigned to the process.

Fetch Job Status A delay of 6 seconds is observed before the system checks the job status using the "Job ID." If the job has not started, the workflow will loop until the job begins.

3. Reconciliation

Once the job has started, the system fetches detailed job execution data ("Job Details").

Update Run Status The system periodically signals the pusher or updates the task history and run status at intervals of 1, 2, 3, 5, 8, or 15 minutes.
Task Synchronization The task history statuses are synchronized, and the system fetches the count of active or queued tasks.
Task Completion Check A check is performed to determine if all tasks are complete. If not, the workflow loops and continues to monitor task statuses.

4. Completion

Update Run Status The system updates the run status once all tasks are confirmed as complete.
Initiate Export The workflow then initiates data export, likely to a database such as MySQL for storage or to Amazon Event-bridge for event handling.
End Workflow Finally, the "End Workflow" process is triggered, marking the conclusion of the crawler's life-cycle.

Activities Description

Activities Description
create_history_id Add history record to the database. Use the same record to retry job deployment; should it fail.
run_scheduled_crawler / run_onetime_crawler This activity calls scheduler-service to deploy job to Kubernetes. Should this activity fail, it retries after some interval. If RunID is passed in the request payload, the workflow is treated as resumption of a previously terminated workflow.
add_workflow_id Registers the workflow’s ID as a history property in the database. Used to stop the run through a signal or resume a terminated workflow.
get_main_job_status Fetches the status of the job after deployment to Kubernetes to ensure it has spawned a pod and started working. Determines if runs are stuck due to Kubernetes saturation.
update_run_started Updates the status of the history from INITIALIZING to PROCESSING once the run has started.
signal_pusher_started Triggers the report run started event and dispatches it to automatically close the dialog box in the front-end interface.
get_active_running_tasks Fetches the list of tasks with an active status in the database.
get_history_statuses_to_sync Takes a list of active tasks, fetches their statuses from Kubernetes, and compares for any changes.
get_bulk_task_history_data Fetches debugging information for each active task.
reconcile_tasks Updates the database with debugging information generated.
sync_statuses Updates the status of active tasks to their actual status in Kubernetes.
count_alive_tasks Checks if any tasks are in active or pending status in the database. If no tasks are alive, the run is deemed complete, and the workflow moves to completion mode.
update_run_status Updates the status of the history to signify successful completion/failure. If a run is manually stopped, its status is set to failed.
cancel_queued_tasks Cancels all child tasks queued for deployment to Kubernetes when a stop signal is received.
k8_stop_task Requests Kubernetes to stop individual jobs. Triggered once for single-process runs or for each active task in multi-process runs.
add_cancellation_log_to_db Adds information to the database to signify that the run has been manually stopped.
signal_pusher_stopped Triggers the report run stopped event and dispatches it to automatically close the dialog box in the front-end interface.
k8_stop_pod Requests Kubernetes to delete the pod of running tasks. Triggered only if the workflow reaches force-stopping mode.
initiate_export_signal Triggers an event to signal that the run has completed and the system can initiate exporting data collected from the data pipeline.
export_supporting_files Zips the supporting files collected during the runtime of the history and uploads them to AWS S3.