A Data Pipeline is a term used to describe a workflow consisting of one or more tasks that ingest, move, and transform raw data from one or more sources to a destination. Usually, the data at the destination is then used for analysis, machine learning, or other business functions. You can generally separate data pipelines into 2 categories: batch processing (most common) and real-time processing pipelines.
"Data Pipeline Patterns" by Informatica.com
A data pipeline's architecture is made up of 4 main parts: a data source, business logic, a destination, and a scheduler (for batch). It can be as simple as a script running on your computer and automated with Cron or Windows task scheduler. A more complex example might be streaming event data, processing it, then powering dashboards or using it to train machine learning models.
The architecture you choose can vary wildly and there is no one size fits all. However, once you learn the basics of building data pipelines, you'll be able to better understand the tradeoffs between architecture decisions.
Common data sources are application databases, APIs, or files from an SFTP server. If you are looking for data sources to practice with or analyze, many governments publish their datasets publicly and you can find them by searching "open data portal."
Business logic is a general term that encompasses the type of transformations that need to be applied to the data inside the data pipeline. It usually involves cleaning, filtering, and applying logic to the data that is specific to the business.
Typically, the target where you send your data is another database. Common data targets are databases or data storage areas that are made for analytics. For example, a data warehouse or data lake.
For simple pipelines, Cron is one of the most commonly used tools to schedule a pipeline to run. You typically install it onto a server where you want your script to run and then use cron syntax to tell it when to run.
For more advanced data pipelines where there are multiple steps that depend on each other, an advanced scheduler called a workflow orchestrator is more appropriate. Popular workflow orchestrations tools are: Airflow, Prefect, and Dagster.
If a business sold software as a service, a Data Engineer might create a data pipeline that runs on a daily basis which takes some of the data generated by the software application, combine it with some data the marketing department has and send the data to a dashboard. The dashboard could then be used to better understand how customers are using the software.
This is a very commonly created type of data pipeline. You can use a CDC pipeline to replicate changes (inserted, updated, and deleted data) from a source database (typically a db backing an application) and into a destination for analytics such as a data warehouse or data lake.
%%{init: { "flowchart": { "useMaxWidth": true } } }%%
graph LR
subgraph Source
direction TB
A[(Application Database)]
A--->|Save incoming changes to log first|AB
AB[Database Transaction Log]
end
subgraph Server
B[CDC Process/Software]
end
Source --->|Read transaction log|B
subgraph Target
B --->|Replicate changes|C
C[(Data Warehouse)]
end
class C internal-link;