Crawler Logs
Introduction¶
The Crawler Logs System provides a centralized logging solution for Greper's web scraping service. Crawlers running as Kubernetes jobs generate logs, which are collected, stored, and made available via a dedicated portal for debugging and monitoring.
flowchart LR
A[Crawler] -->|Sends logs| B[FluentBit]
B -->|Processes & forwards| C[(S3 Storage)]
C <-->|Query data| D[Athena]
D <-->|Fetch/display results| E[Platform]
Key Components¶
| Component | Description |
|---|---|
| Crawlers | Kubernetes pods/jobs that log events (info, errors, debug) |
| FluentBit | Log processor & forwarder (sends logs to S3) |
| S3 Bucket | grepsr-crawler-logs-prod (stores logs in JSON + GZIP) |
| Athena | Query engine to fetch logs from S3 |
| Crawler Logs Service | Microservice that fetches logs via Athena |
| Platform Portal | UI for delivery teams to view/search logs |
Logging Pipeline¶
Log Generation & Collection¶
- Crawlers (Kubernetes jobs) log events using a structured logger (e.g., PHP
logging). - Logs are streamed to FluentBit (running as a DaemonSet in the cluster).
Log Processing & Storage¶
- FluentBit processes logs and forwards them to S3 (
grepsr-crawler-logs-prod). - Storage Format:
- Compression: GZIP
- Format: JSON (structured logs)
- Partitioning: By date (e.g.,
year=YYYY/month=MM/day=DD/hour=HH) - Flush Interval: Every 30 seconds (configurable in FluentBit).
Log Retention Policy¶
- Active Logs: Available for 30 days in S3.
- Archival: Automatically deleted after 30 days (via S3 Lifecycle Policy).
Log Querying & Access¶
Athena Integration¶
Table Schema:
CREATE EXTERNAL TABLE `crawler_logs` (
`timestamp` STRING,
`level` STRING,
`message` STRING,
`crawler_id` STRING,
`job_id` STRING,
`metadata` MAP<STRING, STRING>
)
PARTITIONED BY (`year` STRING, `month` STRING, `day` STRING)
STORED AS PARQUET
LOCATION 's3://grepsr-crawler-logs-prod/'
Query Example:
SELECT `date_time`, message, log
FROM `crawler_logs`
WHERE `log_level` = 'ERROR'
AND year = '2024' AND month = '06' AND day = '15' AND hour = '13'
Crawler Logs Service¶
- A microservice that:
- Accepts log queries (e.g., by
crawler_id,job_id,error level). - Uses Athena to fetch logs from S3.
- Returns structured JSON responses to the Platform.
Platform Portal (UI)¶
- Filter logs by time range, pod name, job ID, log level.
- Search logs using keywords.
- Download logs in CSV/JSON format.
Troubleshooting & Debugging¶
Common Issues & Solutions¶
| Issue | Possible Cause | Resolution |
|---|---|---|
| Logs not appearing in S3 | FluentBit misconfiguration | Check FluentBit logs (kubectl logs -l app=fluent-bit) |
| Athena query timeout | Large dataset | Optimize query with partitions (WHERE year=...) |
| "Access Denied" in S3 | IAM permissions | Verify Athena/S3 access policies |
Debugging Steps¶
Check FluentBit Logs:
kubectl logs -l app=fluent-bit -n logging
Verify S3 Files:
aws s3 ls s3://grepsr-crawler-logs-prod/2024/06/15/
Test Athena Query:
SELECT * FROM crawler_logs LIMIT 10;
Conclusion¶
This system ensures reliable log collection, storage, and retrieval, enabling the delivery team to debug crawler issues efficiently. For further details, refer to:
- FluentBit Configuration
- Athena Query Guide
- Platform Logs UI Documentation
Last Updated: May 2025
Owner: Product Team @ Grepsr