Crawler Lifecycle Workflow using Temporal¶

Background¶

The current model of reflecting runs and their status is based on event-push-action-capture architecture. This model has endured the test of times, however, there have been lots of issues reported against it. We have come to understand that this model, for our use-case, has some flaws. The action that binds the Kubernetes cluster with our database, if missed, cannot be replayed. Consequently, these misses have left lots of runs not updated to their actual status. In the quest of searching for resilient and permanent (although, there is nothing permanent in tech, just long-term solution), we’ve built this workflow on temporal. We intend to heavily use this workflow to resolve issues that we’ve been having, and also bring visibility to the run statuses in a near-real-time manner.

This document describes the operational flow of a crawler from initiation to completion. The workflow integrates with various services and micro-services, including Kubernetes, Scheduler Micro-services, and databases such as MySQL, orchestrated by a central system potentially governed by Amazon Event-bridge.

Workflow Stages¶

The workflow is divided into several main stages: initiation, deployment, reconciliation, and completion.

1. Initiation¶

Trigger The workflow can be initiated via three different triggers:

Scheduled Run: Initiated by a time-based trigger, likely managed by a Scheduler Micro-service.
Manual Run: Initiated by a user or process manually.
API Run: Initiated by an external API call.

Upon initiation, a "Run Request" is generated.

2. Deployment¶

Start Workflow A "Start Workflow" process begins which may encompass the allocation of resources and the setting up of the environment for the crawler to run.

Deploy Job Process The system then deploys the job processes, which likely involves container orchestration using Kubernetes. A unique "Job ID" is assigned to the process.

Fetch Job Status A delay of 6 seconds is observed before the system checks the job status using the "Job ID." If the job has not started, the workflow will loop until the job begins.

3. Reconciliation¶

Once the job has started, the system fetches detailed job execution data ("Job Details").

Update Run Status The system periodically signals the pusher or updates the task history and run status at intervals of 1, 2, 3, 5, 8, or 15 minutes.
Task Synchronization The task history statuses are synchronized, and the system fetches the count of active or queued tasks.
Task Completion Check A check is performed to determine if all tasks are complete. If not, the workflow loops and continues to monitor task statuses.

4. Completion¶

Update Run Status The system updates the run status once all tasks are confirmed as complete.
Initiate Export The workflow then initiates data export, likely to a database such as MySQL for storage or to Amazon Event-bridge for event handling.
End Workflow Finally, the "End Workflow" process is triggered, marking the conclusion of the crawler's life-cycle.

Activities Description¶

Activities	Description
`create_history_id`	Add history record to the database. Use the same record to retry job deployment; should it fail.
`run_scheduled_crawler` / `run_onetime_crawler`	This activity calls scheduler-service to deploy job to Kubernetes. Should this activity fail, it retries after some interval. If `RunID` is passed in the request payload, the workflow is treated as resumption of a previously terminated workflow.
`add_workflow_id`	Registers the workflow’s ID as a history property in the database. Used to stop the run through a signal or resume a terminated workflow.
`get_main_job_status`	Fetches the status of the job after deployment to Kubernetes to ensure it has spawned a pod and started working. Determines if runs are stuck due to Kubernetes saturation.
`update_run_started`	Updates the status of the history from `INITIALIZING` to `PROCESSING` once the run has started.
`signal_pusher_started`	Triggers the report run started event and dispatches it to automatically close the dialog box in the front-end interface.
`get_active_running_tasks`	Fetches the list of tasks with an active status in the database.
`get_history_statuses_to_sync`	Takes a list of active tasks, fetches their statuses from Kubernetes, and compares for any changes.
`get_bulk_task_history_data`	Fetches debugging information for each active task.
`reconcile_tasks`	Updates the database with debugging information generated.
`sync_statuses`	Updates the status of active tasks to their actual status in Kubernetes.
`count_alive_tasks`	Checks if any tasks are in active or pending status in the database. If no tasks are alive, the run is deemed complete, and the workflow moves to completion mode.
`update_run_status`	Updates the status of the history to signify successful completion/failure. If a run is manually stopped, its status is set to failed.
`cancel_queued_tasks`	Cancels all child tasks queued for deployment to Kubernetes when a stop signal is received.
`k8_stop_task`	Requests Kubernetes to stop individual jobs. Triggered once for single-process runs or for each active task in multi-process runs.
`add_cancellation_log_to_db`	Adds information to the database to signify that the run has been manually stopped.
`signal_pusher_stopped`	Triggers the report run stopped event and dispatches it to automatically close the dialog box in the front-end interface.
`k8_stop_pod`	Requests Kubernetes to delete the pod of running tasks. Triggered only if the workflow reaches force-stopping mode.
`initiate_export_signal`	Triggers an event to signal that the run has completed and the system can initiate exporting data collected from the data pipeline.
`export_supporting_files`	Zips the supporting files collected during the runtime of the history and uploads them to AWS S3.