Airbyte is an open-source data integration platform for consolidating data from disparate sources into a single destination. It handles the EL in ELT -- extracting data from APIs, databases, SaaS applications, and files, then loading it into a warehouse, data lake, or analytical database for transformation and querying. The platform ships with a catalog of pre-built connectors and a framework for building custom ones, so you can centralize data without writing bespoke ingestion scripts for every source.
Founded in 2020 by Michel Tricot and John Lafleur, Airbyte has grown rapidly within the modern data stack ecosystem. The project has significant venture backing and an active open-source community contributing connectors and improvements. Its connector-first approach tackles one of data engineering's most persistent pain points: getting data from where it lives to where it needs to be analyzed.
How Airbyte Works
Airbyte's architecture is built around source and destination connectors. A source connector extracts data from an origin system -- a database, an API, a SaaS tool like Salesforce or Stripe, or a file in cloud storage. A destination connector writes that data into the target -- a warehouse like Snowflake or BigQuery, an analytical database like ClickHouse, or a search engine like Elasticsearch or OpenSearch.
Each connection is configured as a sync. When a sync runs, Airbyte extracts data, optionally normalizes it, and loads it into the destination. Syncs operate in several modes:
- Full refresh extracts the complete dataset on every run. Simple but expensive for large tables.
- Incremental sync extracts only new or modified records since the last sync, using a cursor column (timestamp or auto-incrementing ID). This dramatically reduces transfer volume and source system load.
- Change data capture (CDC) reads the database transaction log (WAL for PostgreSQL, binlog for MySQL) to capture inserts, updates, and deletes in near real-time. CDC provides the most accurate and efficient replication for supported databases.
Each connector runs as an isolated Docker container, so connectors can be written in any language and are sandboxed from each other. The platform handles orchestration, scheduling, error handling, retries, and state management for incremental syncs.
Airbyte Open Source vs Airbyte Cloud
Airbyte comes in two forms.
Airbyte Open Source (Self-Managed) is the community edition you deploy on your own infrastructure. It runs on Docker Compose for small setups or Kubernetes for production. You get the full connector catalog, the web UI, and complete control over data and infrastructure. The tradeoff: you handle upgrades, scaling, monitoring, and high availability yourself.
Airbyte Cloud is the fully managed SaaS offering. No infrastructure to manage -- Airbyte handles upgrades, connector updates, scaling, and reliability. Cloud adds role-based access control, SSO, and dedicated support. Pricing is usage-based, calculated by data volume.
Both editions share the same connector framework and catalog. Teams needing data residency controls or wanting to avoid sending sensitive data through a third party typically go self-managed. Teams optimizing for speed and minimal ops overhead lean toward Cloud.
Key Features
350+ pre-built connectors. The catalog covers databases (PostgreSQL, MySQL, MongoDB, SQL Server), cloud warehouses (Snowflake, BigQuery, Redshift), SaaS applications (Salesforce, HubSpot, Stripe, Shopify, Google Analytics, Facebook Ads), file formats, and APIs. The catalog grows through both the core team and community contributions.
Connector Development Kit (CDK). When no pre-built connector exists, the CDK lets you build custom connectors in Python or Java. It handles pagination, rate limiting, authentication, and schema discovery, so you focus on extraction logic specific to your source.
Schema normalization. Airbyte can automatically normalize raw JSON into typed, tabular structures in the destination. Nested API responses get flattened into relational tables that are immediately queryable, no separate transformation step needed.
Change data capture. For supported databases, CDC-based replication captures every insert, update, and delete from the transaction log. Near real-time replication with minimal source database impact.
Typing and deduplication. The destination layer handles type casting and deduplication during loading, so final tables contain clean, correctly typed, deduplicated data.
Airbyte with ClickHouse, OpenSearch, and Elasticsearch
Airbyte provides native destination connectors for ClickHouse, OpenSearch, and Elasticsearch. You can sync data from any of its 350+ sources directly into these systems:
- The ClickHouse destination loads data into ClickHouse tables with append and overwrite modes -- useful for building analytical datasets that aggregate data from multiple operational systems.
- The Elasticsearch destination indexes data for full-text search and analytics over data originally in databases, SaaS tools, or other sources.
- The OpenSearch destination works similarly, loading data for search, log analytics, and observability use cases.
Getting data in is only half the challenge. Schema design, index mappings, shard configuration, query optimization, and performance tuning on the destination side is where the real complexity lies. Poorly configured mappings in Elasticsearch or suboptimal table engines in ClickHouse lead to slow queries, excessive storage, and operational headaches that no ingestion tool can fix on its own.
Common Use Cases
- Centralizing SaaS data for analytics. Pull data from dozens of SaaS tools -- CRM, marketing automation, billing, support -- into a central warehouse or analytical database for joining, aggregating, and dashboarding.
- Database replication. CDC capabilities enable near real-time replication of operational databases into analytical systems, keeping reporting data fresh without impacting production.
- Populating search engines. Sync product catalogs, knowledge bases, or customer data into Elasticsearch or OpenSearch, keeping search indices current as source data changes.
- Data lake ingestion. Load data into S3, GCS, or other object storage in Parquet or JSON, feeding data lake architectures built on Apache Iceberg or Delta Lake.
- Migrating between systems. When switching databases, warehouses, or search engines, Airbyte provides a connector-based approach to moving data without throwaway migration scripts.