Crawler Logs

Introduction¶

The Crawler Logs System provides a centralized logging solution for Greper's web scraping service. Crawlers running as Kubernetes jobs generate logs, which are collected, stored, and made available via a dedicated portal for debugging and monitoring.

flowchart LR
    A[Crawler] -->|Sends logs| B[FluentBit]
    B -->|Processes & forwards| C[(S3 Storage)]
    C <-->|Query data| D[Athena]
    D <-->|Fetch/display results| E[Platform]

Key Components¶

Component	Description
Crawlers	Kubernetes pods/jobs that log events (info, errors, debug)
FluentBit	Log processor & forwarder (sends logs to S3)
S3 Bucket	`grepsr-crawler-logs-prod` (stores logs in JSON + GZIP)
Athena	Query engine to fetch logs from S3
Crawler Logs Service	Microservice that fetches logs via Athena
Platform Portal	UI for delivery teams to view/search logs

Logging Pipeline¶

Log Generation & Collection¶

Crawlers (Kubernetes jobs) log events using a structured logger (e.g., PHP logging).
Logs are streamed to FluentBit (running as a DaemonSet in the cluster).

Log Processing & Storage¶

FluentBit processes logs and forwards them to S3 (grepsr-crawler-logs-prod).
Storage Format:
Compression: GZIP
Format: JSON (structured logs)
Partitioning: By date (e.g., year=YYYY/month=MM/day=DD/hour=HH)
Flush Interval: Every 30 seconds (configurable in FluentBit).

Log Retention Policy¶

Active Logs: Available for 30 days in S3.
Archival: Automatically deleted after 30 days (via S3 Lifecycle Policy).

Log Querying & Access¶

Athena Integration¶

Table Schema:

CREATE EXTERNAL TABLE `crawler_logs` (
  `timestamp` STRING,
  `level` STRING,
  `message` STRING,
  `crawler_id` STRING,
  `job_id` STRING,
  `metadata` MAP<STRING, STRING>
)
PARTITIONED BY (`year` STRING, `month` STRING, `day` STRING)
STORED AS PARQUET
LOCATION 's3://grepsr-crawler-logs-prod/'

Query Example:

SELECT `date_time`, message, log   
FROM `crawler_logs`   
WHERE `log_level` = 'ERROR'   
AND year = '2024' AND month = '06' AND day = '15' AND hour = '13'

Crawler Logs Service¶

A microservice that:
Accepts log queries (e.g., by crawler_id, job_id, error level).
Uses Athena to fetch logs from S3.
Returns structured JSON responses to the Platform.

Platform Portal (UI)¶

Filter logs by time range, pod name, job ID, log level.
Search logs using keywords.
Download logs in CSV/JSON format.

Troubleshooting & Debugging¶

Common Issues & Solutions¶

Issue	Possible Cause	Resolution
Logs not appearing in S3	FluentBit misconfiguration	Check FluentBit logs (`kubectl logs -l app=fluent-bit`)
Athena query timeout	Large dataset	Optimize query with partitions (`WHERE year=...`)
"Access Denied" in S3	IAM permissions	Verify Athena/S3 access policies

Debugging Steps¶

Check FluentBit Logs:
kubectl logs -l app=fluent-bit -n logging
Verify S3 Files:
aws s3 ls s3://grepsr-crawler-logs-prod/2024/06/15/
Test Athena Query:
SELECT * FROM crawler_logs LIMIT 10;

Conclusion¶

This system ensures reliable log collection, storage, and retrieval, enabling the delivery team to debug crawler issues efficiently. For further details, refer to:

FluentBit Configuration
Athena Query Guide
Platform Logs UI Documentation

Last Updated: May 2025
Owner: Product Team @ Grepsr