A data pipeline is a series of automated steps that move data from one or more source systems to a destination, applying transformations along the way. The source might be a transactional database, an event stream, a third-party API, or flat files landing in object storage. The destination is typically a data warehouse, a data lake, a search engine, or another operational system. The pipeline handles extraction, validation, transformation, and loading -- and when built well, it does all of this reliably, repeatedly, and with minimal human intervention.
The term gets used loosely. Sometimes it refers to a simple cron job that copies rows between two databases. Other times it describes a distributed system processing millions of events per second across dozens of services. Both are data pipelines. The complexity varies, but the core idea is the same: data needs to get from where it's produced to where it's consumed, and something has to make that happen correctly.
How Data Pipelines Work
At a high level, every data pipeline follows the same pattern: extract data from sources, apply transformations, and load results into a target system. The details of each step depend on the volume, velocity, and variety of the data, as well as the latency requirements of the use case.
Ingestion is the entry point. Data arrives from databases (via CDC or periodic queries), message brokers (Kafka, Pulsar, RabbitMQ), APIs, log files, or object storage. Ingestion can be pull-based (the pipeline queries the source on a schedule) or push-based (the source emits events that the pipeline consumes).
Transformation is where the raw data gets cleaned, enriched, aggregated, or reshaped. This might mean deduplicating records, joining data from multiple sources, converting timestamps to a common timezone, applying business logic, or computing aggregations. Transformations can happen before loading (ETL) or after (ELT), depending on where the compute power is cheapest and most practical.
Loading writes the processed data into the target system -- a warehouse like Snowflake or BigQuery, a data lake on S3 with Apache Iceberg tables, an Elasticsearch cluster for search, or a downstream operational database.
Orchestration ties the steps together. An orchestrator like Apache Airflow or Dagster defines task dependencies as a directed acyclic graph (DAG), manages scheduling, handles retries on failure, and provides visibility into pipeline status.
Monitoring and observability run alongside everything else. You need to know when a pipeline fails, when data arrives late, when row counts deviate from expectations, and when schema changes in a source system break downstream consumers.
Types of Data Pipelines
Batch Pipelines
Batch pipelines process data in discrete chunks on a schedule -- every hour, every night, every week. They're the oldest and most common pattern. A batch pipeline might extract the previous day's transactions from a production database, transform them into a star schema, and load them into a warehouse before the business day starts.
Batch is straightforward to reason about, easy to test, and works well when latency requirements are measured in hours rather than seconds. Apache Spark, dbt, and SQL-based warehouse transformations are the typical tools here. The tradeoff is that data is always stale by at least one processing interval.
Streaming Pipelines
Streaming pipelines process data continuously as it arrives, with latencies ranging from sub-second to a few minutes. The source is typically a message broker like Apache Kafka or Amazon Kinesis, and the processing engine is Apache Flink, Kafka Streams, or Spark Structured Streaming.
Streaming is essential for use cases like fraud detection (you can't wait an hour to flag a suspicious transaction), real-time personalization, operational monitoring, and IoT sensor processing. The tradeoff is significantly higher complexity -- you have to deal with out-of-order events, exactly-once semantics, state management, and backpressure.
Hybrid (Lambda and Kappa) Architectures
Lambda architecture runs batch and streaming pipelines in parallel against the same data. The batch layer provides complete, accurate results with higher latency; the speed layer provides approximate, low-latency results. A serving layer merges the two views. This was the dominant pattern for years, but the operational cost of maintaining two separate codebases for the same logic is significant.
Kappa architecture simplifies this by treating everything as a stream. Historical reprocessing is handled by replaying the event log (typically Kafka with long retention) through the same streaming pipeline. Apache Flink's ability to handle both real-time and historical data through a single API has made Kappa increasingly practical. Many modern teams adopt a Kappa-heavy design with selective batch components where true batch processing is genuinely simpler.
Key Components and Tools
The data pipeline ecosystem is broad, but certain tools have become standard:
Apache Kafka is the de facto standard for event streaming and data integration. It serves as the central nervous system in many architectures -- a durable, distributed log that decouples producers from consumers. Kafka Connect provides pre-built connectors for databases, cloud services, and file systems.
Apache Flink is a distributed stream processing engine with strong guarantees around exactly-once processing, event-time semantics, and stateful computation. It handles both streaming and batch workloads through a unified API and is widely used for real-time ETL, analytics, and event-driven applications.
Apache Spark remains the workhorse for large-scale batch processing and is increasingly capable in streaming via Structured Streaming. Its DataFrame API and SQL support make it accessible, and its integration with nearly every storage and catalog system makes it a practical default for batch transformations.
Apache Airflow is the most widely adopted workflow orchestrator for data pipelines. Pipelines are defined as Python code (DAGs), with built-in support for scheduling, retries, alerting, and a web UI for monitoring. Dagster and Prefect are newer alternatives that address some of Airflow's architectural limitations, particularly around testing and development workflows.
dbt (data build tool) has become the standard for transformation logic in ELT pipelines. It lets analysts and engineers write transformations as SQL SELECT statements, handles dependency resolution, and provides built-in testing and documentation. dbt assumes data is already loaded into the warehouse and focuses purely on the T in ELT.
Managed cloud services -- AWS Glue, Google Cloud Dataflow, Azure Data Factory, Amazon MWAA (managed Airflow), Confluent Cloud (managed Kafka) -- reduce operational burden but introduce vendor coupling.
ETL vs. ELT
The distinction matters more than it might seem. In traditional ETL (Extract, Transform, Load), data is transformed before it reaches the destination. This was necessary when warehouses were expensive and compute-limited -- you cleaned and aggregated data before loading to minimize storage and query costs.
ELT (Extract, Load, Transform) flips the order. Raw data is loaded into a modern cloud warehouse or data lake first, then transformed in place using the destination's own compute engine. This approach became viable as warehouses like Snowflake and BigQuery made storage cheap and compute elastic. ELT preserves the raw data, which means you can always reprocess it when business logic changes. dbt is the tool most associated with ELT workflows.
In practice, most pipelines are a mix. You might do light cleaning and validation during extraction (that's ETL), load into a lake, then run heavy transformations in the warehouse (that's ELT).
Common Use Cases
Analytics and business intelligence. The classic use case -- moving transactional data into a warehouse where analysts can query it, build dashboards, and generate reports. This is batch ETL/ELT at its most straightforward.
Real-time operational intelligence. Monitoring infrastructure health, tracking order status, detecting anomalies in financial transactions, or powering live dashboards. These require streaming pipelines with low latency.
Machine learning feature pipelines. ML models need features computed from raw data. Feature pipelines extract data from various sources, compute features (often involving complex joins and aggregations), and serve them to model training and inference systems. Both batch (for training) and streaming (for real-time inference) pipelines are common.
Data synchronization and CDC. Change data capture pipelines replicate changes from operational databases into analytical systems in near real-time, using tools like Debezium to read database transaction logs and stream changes through Kafka.
Search indexing. Keeping search engines like Elasticsearch or OpenSearch in sync with source data requires pipelines that detect changes and update search indices, handling document transformations and enrichment along the way.
Challenges
Schema evolution. Source systems change -- columns get added, types get modified, fields get renamed. A robust pipeline must detect schema drift and either adapt automatically or fail gracefully with clear error messages. Schema registries (like Confluent Schema Registry) and table formats with built-in schema evolution (like Apache Iceberg) help, but the problem never fully goes away.
Data quality. Garbage in, garbage out. Pipelines need validation at every stage -- null checks, type enforcement, range validation, referential integrity checks, anomaly detection on row counts and value distributions. Tools like Great Expectations and dbt tests help, but data quality is fundamentally a discipline, not just a tool.
Exactly-once processing. Ensuring every record is processed exactly once -- not dropped, not duplicated -- is genuinely hard in distributed systems. Kafka and Flink provide exactly-once semantics under specific configurations, but achieving end-to-end exactly-once across an entire pipeline requires careful design at every boundary.
Operational complexity. A production data pipeline is a distributed system with all the usual failure modes: network partitions, disk failures, OOM kills, upstream API rate limits, credential expiry. Monitoring, alerting, and automated recovery are not optional.
Cost management. Cloud data pipelines can become expensive quickly. Streaming infrastructure runs continuously, warehouse compute scales with query volume, and storage grows indefinitely unless managed. Right-sizing resources, implementing data retention policies, and choosing between real-time and batch based on actual requirements (not aspirational ones) are ongoing concerns.
Best Practices
Define pipelines as code. Pipeline definitions, transformation logic, and infrastructure configuration should all live in version control. This enables code review, rollback, reproducibility, and collaboration. Airflow DAGs as Python, dbt models as SQL, and infrastructure as Terraform are the standard approach.
Design for idempotency. Every pipeline step should produce the same result whether it runs once or ten times. This makes retries safe and simplifies recovery from failures. Idempotent writes (using MERGE or REPLACE operations) are essential.
Monitor data, not just infrastructure. CPU and memory metrics tell you whether your pipeline is running. Data quality metrics tell you whether it's running correctly. Track row counts, null rates, value distributions, freshness, and schema consistency.
Start with batch, move to streaming when justified. Streaming adds significant complexity. If your business requirement is a dashboard that refreshes every morning, a batch pipeline running overnight is simpler, cheaper, and easier to maintain. Reserve streaming for use cases where latency genuinely matters.
Plan for failure. Dead letter queues for records that fail processing, circuit breakers for upstream dependencies, exponential backoff on retries, and clear alerting when human intervention is needed. The question is not whether your pipeline will fail, but how quickly and cleanly it recovers.
Decouple stages with durable messaging. Using Kafka or a similar system between pipeline stages means that a failure in one stage does not cascade to others. The upstream stage can continue producing while the downstream stage recovers and catches up from the log.
Organizations like Uber, Shopify, Netflix, Airbnb, and LinkedIn run data pipelines at massive scale -- processing trillions of events daily. But the same architectural principles apply whether you're handling a hundred records or a hundred billion: extract reliably, transform correctly, load efficiently, and monitor everything.