Skip to content

Crawler Scheduler

Introduction

The Grepsr platform provides a comprehensive report scheduling feature that allows users to automate the execution of reports within their projects. These reports are associated with dedicated crawlers mapped based on service codes and have distinct data structures that can be accessed under the Datasets menu in the platform. Users have the flexibility to run reports manually or schedule them to run automatically at specific dates and times, whether as one-time occurrences or recurring events. This documentation provides a detailed overview of the workflow and various components involved in the Grepsr Platform Scheduler.

Scheduler Components

The scheduler workflow comprises several key components, each serving specific roles and responsibilities:

  1. Schedule Calculator Daemon: The Schedule Calculator Daemon is an ever-active process responsible for continuously calculating upcoming schedules for the next three days. It scans the vt_extractor_schedule table to identify active schedules eligible for execution within the upcoming 3 days. Once eligible schedules are determined, the relevant report information is inserted into the vt_scheduled_runs table, ready for execution.

  2. Schedule Processor Daemon: The Schedule Processor Daemon plays a critical role in initiating the execution of scheduled reports when the specified time triggers. It achieves this by making a gRPC call to the Scheduler service's OneTimeRun method. This call effectively kick-starts the deployment process of the associated crawler image by interacting with the Kubernetes API.

  3. Phoenix Scheduler Service: The Scheduler service in the Grepsr platform plays a crucial role in managing and orchestrating the execution of scheduled reports. It offers significant functionality that contributes to the configuration and management of report schedules, as well as the seamless deployment of crawler images using the Kubernetes API. It offers a wide range of features and capabilities through various gRPC methods:

    OneTimeTestRun: Allows for the immediate execution of a report as a one-time test run.

    StopRun: Stops the execution of a running report. 

    Schedule: Sets up a new schedule for a report to be executed at specific dates and times. 

    StopSchedule: Stops a scheduled report from being executed. 

    UpdateSchedule: Modifies the configuration of an existing schedule. 

    AddToTaskQueue: Used for multi processor instances of reports. 

    GetTaskStatus: Retrieves the status of a specific task. 

    MultiProcessRun: Initiates the execution of a report with multiple processor instances. 

    GetTasksStatus: Retrieves the status of multiple tasks. 

    GetActiveTasksV2: Retrieves information about active tasks. 

    GetScheduledTasks: Retrieves information about scheduled tasks. 

    GetCompletedTasks: Retrieves information about completed tasks. 

    GetChildTasks: Retrieves information about child tasks associated with a parent task. 

    StopTask: Stops the execution of a specific task. 

    StopChildTasks: Stops the execution of all child tasks associated with a parent task. 

    GetSchedules: Retrieves information about all existing schedules. 

    UnpauseSchedule: Resumes the execution of a paused schedule. 

    UpdateTaskHistory: Updates the historical records of a task. 

    UpdateMetaByJob: Updates the metadata of a specific job. 

    UpdateTaskHistoryByBatch: Updates the historical records of multiple tasks in a batch.

    GetTaskInfo: Retrieves detailed information about a specific task. 

    GetStats: Retrieves statistics related to task execution and performance. 

    GetTaskHistoryBySchedulerId: Retrieves the historical records of tasks associated with a specific scheduler ID.

    GetTaskHistoryInfo: Retrieves detailed information about the historical records of a specific task. 

    UpdateAllTaskHistoryByStatus: Updates the historical records of tasks based on their status. 

    GetMissedScheduleRuns: Retrieves information about missed schedule runs. 

    GetUpcomingAccountSchedules: Retrieves information about upcoming schedules for an account. 

    GetAllSchedules: Retrieves information about all existing schedules. 

    GetAllMissedRuns Retrieves information about all missed schedule runs. 

    GetNextSchedules Retrieves information about the next set of schedules to be executed. 

These gRPC methods provide comprehensive control and visibility over the scheduling and execution of reports within the Grepsr platform, ensuring efficient and reliable report management for users.

Events Emitted by Scheduler Service

  1. ReportScheduleCreatedEvent
  2. ReportScheduleStoppedEvent
  3. ReportScheduleDeletedEvent
  4. ReportScheduleUnpausedEvent

These events are just used by Activity Pipeline to be stored on dynamo db for log purposes and will be displayed in the activity feed page in platform.