Steps of Big Data Pipeline

Andre Yai
2 min readDec 28, 2023

--

With the increase in computational and storage power, companies have been collecting more data than ever. This leading the need for new tasks and job opportunities. In order to extract value from data companies should rely on data pipelines. These pipelines consist of stages like collection, storage, process, and analyzing data.

Collection

This step is responsible for ingesting data from different sources to use them for later analysis. This data comes mainly from real-time and batch sources.

In real-time platforms, we have those who produce data (Producers) and those who consume data (Consumers). Usually, an example of it would be what Netflix and Spotify use to send their data to millions of users. Examples of streaming include services like Kafka, AWS Kinesis, GCP Pub/Sub, and Azure PubSub.

Batch collection step may involves migrating data from an existing database. For example ingest data from a transactional database like PostgreSQL, MySQL, Oracle to a data lakes or data warehouses like AWS Redshift. For that in AWS, you can use the AWS Data Migration Service.

Storage

Once we collect our data it will need a place to be stored. In this service, by knowing their frequency and need we can control data lifecycle. This goes from getting more frequent data to archiving or deleting them.

Some services that help with that would be AWS S3, GCP Big Query, and Azure Blob Storage.

Process

This step deals with ETL which involves the process of cleaning, enriching, and transforming raw data into a more sophisticated layer.

Some services that help with that would be AWS Glue, AWS EMR, AWS Lambda,GCP DataFlow, GCP Functions, Azure DataFactory, and Azure Functions.

Governance

Data governance consists of data management, data quality, and data stewardship. This helps to manage policies to access data, data discovery, data accuracy, validation, and completeness.

Some services that help with them are AWS Glue Catalog, AWS LakeFormation, GCP Data Catalog, GCP Data Loss Prevention.

Analyze

This part is responsible for extracting value from data by performing data analysis, machine learning, and data visualization. This consists in extracting meaning from data by showing how it is organized, grouping, and predicting it.

Some services that help with that would be AWS Sagemaker, AWS QuickSight, GCP BigQuery, GCP Vertex AI, GCP Looker, GCP Data Studio, Azure ML, Azure Databricks, Azure Data Studio.

--

--

Andre Yai

Follow me on this journey of learning more about cloud, machine learning systems, and big data. https://br.linkedin.com/in/andre-yai