Skip to content

Data Profiler

Introduction

Data profile generation is carried out on any exported [CSV, XLSX] dataset. The generated report is extremely helpful for Data QA personnel as it contains brief insight as well as deep column-wise analytical information. The profile generation can be triggered manually from datasets page in platform by any ADMIN user. The profile generation process usually takes about 2-10 minutes depending on the size of the dataset being fed.

Link: Data Profiler

Lifecycle: Temporal Workflow

The profile generation has a temporal workflow that orchestrates the generation process. The data profile generation has few checbeks that need to be satisfied before deployment of the generation process.

Activities

  • get_dataset_meta: It calls GetPageInfo rpc definition in crawldata-query-service to fetch information from the dataset. Its response information is useful for further checks in this workflow.

  • check_cell_count_threshold: The profile generation job has a constraint of cell-count. The total cell count of each page in the dataset cannot exceed 10,000,000. This activity, with the page information, carries out checks whether it is okay to proceed with the profile generation process.

  • get_file_ident: This activity fetches the file identifier of the latest exported dataset, either CSV or XLSX. If both are present, CSV will be chosen because of its simple nature.

  • get_file_url: It fetches the url of the exported dataset with the file identifier. The fetched url is signed to be publicly available. It calls get_s3_url rpc definition in delivery-service to fetch url.

  • run_report_profiler: It calls the GenerateProfile RPC definition in data-profiler-service with appropriate payload. This rpc prepares and deploys Kubernetes job to generate the profile report.

  • get_main_job_status: After the deployment of profile generation jobs, this activity is invoked periodically to monitor the status of the job. This activity requests Kubernetes to fetch real-time status of the job.

  • get_profile_report_url: On successful completion of kubernetes job for profile generation, this activity is called to fetch public urls of the generated reports in S3. This activity uses standard format to generate urls for the report.

  • mark_report_generation: Once the report generation job reaches terminal state, this activity is invoked. This activity invokes MarkReportGeneration rpc definition in data-profiler-service. This rpc, eventually, marks either the run failed or completed in the database according to the payload supplied to it. On successful completion of the job, the urls of reports generated and uploaded to S3 will also be added to the database.