Skip to content

Crawler Logs

Introduction

The Crawler Logs System provides a centralized logging solution for Greper's web scraping service. Crawlers running as Kubernetes jobs generate logs, which are collected, stored, and made available via a dedicated portal for debugging and monitoring.

flowchart LR
    A[Crawler] -->|Sends logs| B[FluentBit]
    B -->|Processes & forwards| C[(S3 Storage)]
    C <-->|Query data| D[Athena]
    D <-->|Fetch/display results| E[Platform]

Key Components

Component Description
Crawlers Kubernetes pods/jobs that log events (info, errors, debug)
FluentBit Log processor & forwarder (sends logs to S3)
S3 Bucket grepsr-crawler-logs-prod (stores logs in JSON + GZIP)
Athena Query engine to fetch logs from S3
Crawler Logs Service Microservice that fetches logs via Athena
Platform Portal UI for delivery teams to view/search logs

Logging Pipeline

Log Generation & Collection

  • Crawlers (Kubernetes jobs) log events using a structured logger (e.g., PHP logging).
  • Logs are streamed to FluentBit (running as a DaemonSet in the cluster).

Log Processing & Storage

  • FluentBit processes logs and forwards them to S3 (grepsr-crawler-logs-prod).
  • Storage Format:
  • Compression: GZIP
  • Format: JSON (structured logs)
  • Partitioning: By date (e.g., year=YYYY/month=MM/day=DD/hour=HH)
  • Flush Interval: Every 30 seconds (configurable in FluentBit).

Log Retention Policy

  • Active Logs: Available for 30 days in S3.
  • Archival: Automatically deleted after 30 days (via S3 Lifecycle Policy).

Log Querying & Access

Athena Integration

Table Schema:

CREATE EXTERNAL TABLE `crawler_logs` (
  `timestamp` STRING,
  `level` STRING,
  `message` STRING,
  `crawler_id` STRING,
  `job_id` STRING,
  `metadata` MAP<STRING, STRING>
)
PARTITIONED BY (`year` STRING, `month` STRING, `day` STRING)
STORED AS PARQUET
LOCATION 's3://grepsr-crawler-logs-prod/'

Query Example:

SELECT `date_time`, message, log   
FROM `crawler_logs`   
WHERE `log_level` = 'ERROR'   
AND year = '2024' AND month = '06' AND day = '15' AND hour = '13'

Crawler Logs Service

  • A microservice that:
  • Accepts log queries (e.g., by crawler_id, job_id, error level).
  • Uses Athena to fetch logs from S3.
  • Returns structured JSON responses to the Platform.

Platform Portal (UI)

  • Filter logs by time range, pod name, job ID, log level.
  • Search logs using keywords.
  • Download logs in CSV/JSON format.

Troubleshooting & Debugging

Common Issues & Solutions

Issue Possible Cause Resolution
Logs not appearing in S3 FluentBit misconfiguration Check FluentBit logs (kubectl logs -l app=fluent-bit)
Athena query timeout Large dataset Optimize query with partitions (WHERE year=...)
"Access Denied" in S3 IAM permissions Verify Athena/S3 access policies

Debugging Steps

Check FluentBit Logs:
kubectl logs -l app=fluent-bit -n logging
Verify S3 Files:
aws s3 ls s3://grepsr-crawler-logs-prod/2024/06/15/
Test Athena Query:
SELECT * FROM crawler_logs LIMIT 10;


Conclusion

This system ensures reliable log collection, storage, and retrieval, enabling the delivery team to debug crawler issues efficiently. For further details, refer to:

  • FluentBit Configuration
  • Athena Query Guide
  • Platform Logs UI Documentation

Last Updated: May 2025
Owner: Product Team @ Grepsr