High Level Architecture¶

Introduction¶

The web scraping platform service/platform offered by our company consists of various components that work together seamlessly to provide efficient data extraction and processing capabilities. The core of our platform is a web application developed in Next.js, serving as the user interface for our clients. Additionally, we have a Kubernetes cluster hosting a set of microservices responsible for different tasks.

To collect data from the web, we deploy crawlers within the Kubernetes cluster. These crawlers utilize Kubernetes' scalability and reliability to fetch web data and pass it to Fluent, our messaging queue system. From there, the data is transferred to AWS Kinesis, which is a fully managed streaming data platform provided by Amazon Web Services. The stream of data in kinesis is processed using apache flink applications. Apache Flink is an open-source stream processing framework designed to handle large-scale, high-throughput, and fault-tolerant data processing tasks. It provides a unified and efficient platform for real-time event-driven processing, data streaming, and batch processing, enabling the development of robust and scalable data pipelines.

Once the data has been processed, it is stored in a MongoDB database and s3 bucket for easy retrieval and storage. To facilitate communication between microservices, we utilize gRPC, a high-performance, language-agnostic remote procedure call framework. This enables seamless interaction and data exchange between the microservices deployed in our Kubernetes cluster.

Our platform is built on an event-driven architecture. Various actions within the system emit events using AWS EventBridge, a fully managed event bus service. These events are then processed using temporal workflows. Temporal is an open source, distributed, and scalable workflow and event-driven programming framework. These event processors are responsible for executing the business logic and performing the necessary actions based on the received events.

Overall, our platform provides a resilient, robust and scalable solution for web scraping and data processing, leveraging Kubernetes, AWS Kinesis, Fluent, MongoDB, gRPC, Temporal and an event-driven architecture to ensure efficient data extraction, transformation, and storage for our clients.

Major Components¶

1. Web Platform¶

The Grepsr platform, built using Next.js framework, enhances its capabilities and provides a robust foundation for development. Built on top of React, Next.js offers server-side rendering (SSR) capabilities, enabling faster initial page loads and improved SEO performance. It serves as a versatile solution catering to both internal and external users for a multitude of purposes. It offers a user-friendly interface for creating projects within an organization. These projects can encompass multiple reports, each of which is associated with a service code that represents customized crawler images tailored to specific use cases. Additionally, users have the ability to manage project members and roles, schedule automated report runs, manually trigger report executions, and export datasets. The platform also allows users to configure delivery options, specifying the desired data format and medium through which they wish to receive the data. With its comprehensive range of features, Grepsr provides a powerful and flexible data extraction and delivery solution. The Grepsr platform serves as a valuable tool for internal users by enabling them to generate billing and manage configurations.

2. Auth0¶

The Grepser platform seamlessly integrates with Auth0, a third-party Identity and Access Management (IAM) solution. By leveraging Auth0's robust authentication and authorization features, Grepser enhances its security and user management capabilities. With Auth0, users can securely log in to the Grepser platform, ensuring that only authorized individuals can access sensitive data and perform actions within the system. Grepsr benefits from Auth0's comprehensive user management capabilities, including user registration, password resets, and role-based access control. The integration with Auth0 empowers Grepsr to provide a secure and user-friendly environment for organizations, safeguarding their data while offering seamless authentication and authorization processes.

3. Microservices¶

Grepsr adopts a microservices architecture, utilizing multiple microservices predominantly written in Python. This architectural approach allows for the creation of modular, independent services that work together to form the comprehensive functionality of the platform. Each microservice is dedicated to a specific domain, such as the account service, schedule service, stats service, project service, payment service, and more.

Microservices offer several key features that enhance scalability and flexibility. Firstly, they enable individual services to be developed, deployed, and scaled independently, promoting agility and faster iteration cycles. This decentralized structure allows teams to focus on specific services without impacting the entire system. Additionally, microservices facilitate fault isolation, as issues within one service can be contained and resolved without affecting the overall functionality.

Moreover, microservices promote maintainability by decoupling different components, making it easier to understand, update, and modify individual services without affecting others. This separation of concerns enhances scalability, allowing organizations to scale specific services based on demand, rather than scaling the entire application.

In summary, Grepsr leverages microservices to create a highly scalable and modular architecture. By utilizing individual services written in Python, the platform achieves flexibility, fault isolation, maintainability, and efficient scalability, enabling it to meet the diverse needs of its users while ensuring optimal performance.

4. AWS Kinesis¶

Kinesis plays a crucial role in the data streaming infrastructure of the Grepsr platform, handling high volumes of data on a daily basis. With its scalable architecture, Kinesis ensures that the platform can accommodate increasing data volumes without compromising performance. Additionally, Kinesis provides robust durability and fault tolerance, ensuring that data is reliably stored and protected against potential failures. To overcome the throughput limitations of Kinesis, the platform integrates Fluentd as a messaging queue. Fluentd acts as a buffer between the data crawlers and Kinesis shards, ensuring smooth and efficient data flow.

Kinesis is primarily utilized for two main purposes within the platform. Firstly, it serves as a robust data pipeline, capable of handling large volumes of data streaming in real-time. This enables Grepsr to process and analyze vast amounts of data seamlessly. Secondly, Kinesis serves as a storage solution for events before they are processed by event processors. While the legacy system utilized SNS and SMS, the migration to Kinesis provides enhanced capabilities and scalability. As Grepsr transitions to Temporal for event processing, Kinesis continues to store events until the migration is complete. Overall, Kinesis forms a critical component of the Grepsr data streaming infrastructure, ensuring efficient data handling, storage, and processing.

5. Data Pipelines¶

Grepsr leverages Apache Flink as its data pipeline framework for processing real-time data efficiently. Apache Flink offers a powerful and scalable solution for stream processing, enabling Grepsr to handle large volumes of data in real-time with low latency. Flink's key feature is its ability to process streaming data in a fault-tolerant manner, ensuring data integrity and system reliability.

One notable feature of Apache Flink is its checkpointing mechanism, which plays a vital role in achieving fault tolerance. Checkpointing allows Flink to take periodic snapshots of the streaming state, ensuring that in the event of a failure or restart, the processing can resume from a consistent state. This feature guarantees data consistency and fault tolerance, reducing the risk of data loss and ensuring uninterrupted data processing.

Additionally, Flink provides advanced capabilities for event time processing, windowing, and stream transformations, allowing Grepsr to perform complex data processing operations on real-time data streams. Flink's support for various data sources and sinks enables seamless integration with other components of the Grepsr ecosystem, facilitating data ingestion, processing, and delivery.

By utilizing Apache Flink as its data pipeline framework, Grepsr benefits from its robust stream processing capabilities, fault tolerance through checkpointing, and versatility in handling real-time data. This integration ensures that Grepsr can efficiently process and analyze streaming data while maintaining data consistency and reliability even in the face of failures.

6. Event Processors¶

Grepsr is currently undergoing a migration to Temporal.io as its chosen event processing solution. Previously, Grepsr relied on Apache Flink and Node.js applications for event processing. However, due to a lack of visibility into the event-driven architecture and the need for improved consistency, the decision was made to migrate to Temporal.io.

Temporal.io offers comprehensive capabilities for event-driven architecture, aligned with the core concept of eventual consistency. By adopting Temporal.io, Grepsr gains access to essential features for managing events and ensuring consistency, all while providing enhanced visibility through its user interface.

As part of the migration process, events generated by various services are now sent to AWS EventBridge. From there, a Lambda function is triggered, which executes the Temporal.io workflow. This integration allows for seamless event processing and workflow execution within the Temporal.io ecosystem.

The adoption of Temporal.io as the event processing solution allows Grepsr to ensure reliable and consistent event-driven processes, while also providing valuable visibility through the user interface. This migration signifies Grepsr's commitment to enhancing its event processing capabilities and aligning with a scalable and efficient architecture that supports its evolving needs.

7. Databases¶

The Grepsr platform relies on a trio of databases to fulfill its various data storage and management needs. Firstly, MySQL (RDS) serves as the primary database, responsible for storing application data critical to the platform's functionality. MySQL offers a reliable and scalable solution for managing and querying structured data efficiently.

In addition, Grepsr utilizes MongoDB to store the crawled data. MongoDB's flexible and schema-less nature makes it an ideal choice for storing unstructured or semi-structured data, enabling efficient storage and retrieval of the vast amount of data collected through crawling activities.

For storing events and logs, Grepsr leverages DynamoDB, a highly scalable and fully managed NoSQL database service provided by AWS. DynamoDB's seamless scalability and low latency ensure optimal performance when handling event and log data, supporting Grepsr's high-volume data processing requirements.

Furthermore, the platform utilizes an S3 bucket to store historical data in Avro format, enabling future data extraction and analysis. S3's object storage capability provides a reliable and cost-effective solution for long-term data archival, ensuring the data remains accessible for future needs.

By leveraging MySQL, MongoDB, DynamoDB, and S3, the Grepsr platform efficiently manages various types of data, ranging from application data to crawled data, events, logs, and historical data. This combination of databases offers flexibility, scalability, and optimized storage solutions, enabling Grepsr to handle diverse data requirements while ensuring robust data management and accessibility.

8. Kubernetes¶

The Grepsr platform relies on Kubernetes as its container orchestration platform to streamline the deployment and management of its microservices, web applications, data pipelines, and crawlers. Kubernetes serves as a powerful and efficient tool for automating containerized application deployments, scaling resources, and ensuring high availability. With Kubernetes, Grepsr can leverage the benefits of containerization, allowing for easy scalability and portability across different environments.

In the past, Grepsr used to deploy its containerized workloads using Singularity, which was based on the open-source Apache Mesos platform. However, to further enhance its container orchestration capabilities, Grepsr made the decision to migrate to AWS EKS (Elastic Kubernetes Service). This migration to EKS allows Grepsr to leverage the benefits of a managed Kubernetes service provided by AWS, including simplified cluster management, scalability, and enhanced integration with other AWS services. The migration process has been successful, with approximately 85% of the deployment transition already completed. This move to EKS signifies Grepsr's commitment to utilizing industry-standard, cloud-native technologies to ensure optimal performance, scalability, and maintainability of its containerized workloads.

By using AWS EKS (Elastic Kubernetes Service) as its managed Kubernetes service, Grepsr benefits from the robust infrastructure provided by Amazon Web Services. EKS simplifies the management of Kubernetes clusters, automating tasks such as patching, scaling, and node provisioning. This enables Grepsr to focus on its core functionalities without worrying about the underlying infrastructure.

Kubernetes provides Grepsr with a reliable and scalable platform for managing its complex ecosystem of microservices, web applications, data pipelines, and crawlers. It ensures efficient resource utilization, fault tolerance, and ease of deployment, making it an ideal choice for running and orchestrating containerized workloads in a cloud-native environment.