What is dbt (Data Build Tool)?

dbt (data build tool) is an open-source command-line framework for managing data transformations inside a data warehouse. Analytics engineers and data teams write modular SQL SELECT statements; dbt compiles them, resolves dependencies, and executes them against the warehouse in the correct order. It handles the T in ELT -- it doesn't extract or load data, but assumes your data has already landed in a warehouse (Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, and others) and focuses entirely on transforming it into clean, tested, documented models.

Created in 2016 by Tristan Handy, Drew Banin, and Connor McArthur at Fishtown Analytics (now dbt Labs), dbt started as a tool to add transformation capabilities on top of Stitch. It has since become the industry standard for analytics engineering, used by over 80,000 teams including JetBlue, HubSpot, NASDAQ, and Anheuser-Busch. dbt Labs surpassed $100 million in annual recurring revenue and 5,000 customers, with 85% year-over-year growth among Fortune 500 companies.

How dbt Works

At its core, dbt does one thing: takes SQL files from your project, compiles them, and runs the resulting SQL against your warehouse. Every transformation is a model -- a single SQL SELECT statement in a .sql file. You never write CREATE TABLE or INSERT; dbt generates those based on a materialization strategy: view, table, incremental, or ephemeral (inlined as a CTE).

Models reference each other via the ref() function. Writing SELECT * FROM {{ ref('stg_orders') }} tells dbt to resolve that reference, build a directed acyclic graph (DAG) of all dependencies, and execute them in topological order. The DAG determines execution order, enables partial builds (only a model and its downstream dependents), and powers lineage visualization.

dbt uses Jinja templating. SQL files are actually Jinja templates compiled before execution, enabling macros, control flow, environment variables, and dynamic SQL generation. Analysts write SELECT statements; the framework handles everything else.

Materializations determine how models are physically built. Views are cheap and always fresh but slow for heavy queries. Tables are fast to query but require full rebuilds. Incremental models are the workhorse for large datasets -- dbt appends or merges only new or changed rows using a configured strategy (append, merge, delete+insert, or microbatch), dramatically cutting build times and warehouse costs for billion-row tables.

dbt Core vs dbt Cloud

dbt comes in two forms sharing the same transformation engine.

dbt Core is the open-source Python CLI. Install via pip, configure profiles.yml with warehouse credentials, run dbt run, dbt test, dbt build from the terminal. Free, runs anywhere Python does -- laptops, Docker containers, Airflow or Dagster tasks, CI pipelines. You manage scheduling, orchestration, and environments yourself.

dbt Cloud is the commercial SaaS from dbt Labs. It wraps Core with a web IDE, built-in scheduling, CI/CD integration, hosted documentation, the Semantic Layer, and governance features. The Cloud IDE handles Git operations, provides autocomplete and live previews, and lowers the barrier for analysts uncomfortable with terminals. Enterprise plans add SSO, audit logging, SOC 2 compliance, and RBAC.

The free Developer plan supports a single user. The Team plan runs ~$100/developer/month with the Semantic Layer and higher job limits. Enterprise pricing is custom, unlocking dbt Mesh and advanced governance.

Key Features

Testing. A built-in testing framework. Generic tests -- unique, not_null, accepted_values, relationships -- are declared in YAML and run with dbt test. Singular tests are custom SQL queries returning failing rows. Source freshness tests catch stale upstream data before it propagates. The dbt ecosystem (dbt-utils, dbt-expectations) adds dozens more reusable test types.

Documentation. Every model, column, source, and test can be documented in YAML alongside the code. dbt docs generate compiles a static site with search and a visual DAG explorer. Because docs live in version control next to the SQL, they stay in sync -- a real improvement over wikis and spreadsheets that inevitably drift.

Packages. dbt Hub hosts community packages declared in packages.yml and installed with dbt deps. Popular ones include dbt-utils, dbt-expectations, and source-specific packages like dbt-snowflake-utils. Versioned and composable, following the same dependency resolution as models.

Snapshots. Implements Type-2 slowly changing dimensions (SCD2). Tracks changes to mutable source data over time by recording when each row version was valid, using timestamp or check strategies. Full audit trail without CDC infrastructure.

Seeds. Small reference datasets -- country codes, category lookups, status enums -- stored as CSVs and loaded with dbt seed. Version-controlled alongside transformations.

The Semantic Layer

The dbt Semantic Layer, powered by MetricFlow, lets you define business metrics as code in YAML. Declare semantic models (mapped to existing dbt models), define dimensions, measures, and metrics on top of them; MetricFlow dynamically generates the SQL to compute those metrics at query time, including joins.

The practical value: metric consistency. Instead of every BI tool reimplementing "monthly recurring revenue" with slightly different logic, the metric is defined once and served through the Semantic Layer API. Integrations with Tableau, Hex, Mode, Google Sheets, and other tools query it directly, ensuring everyone works from the same definitions. Available in dbt Cloud Team and Enterprise plans.

dbt Mesh

dbt Mesh addresses scaling dbt across large organizations with multiple data teams. Instead of one monolithic project with hundreds of models, Mesh splits work into domain-aligned projects -- each with its own repo, CI pipeline, and ownership.

Teams publish stable interfaces using model contracts (enforcing expected columns and types) and model versions (enabling graceful deprecation). Other projects reference published models via cross-project ref(), with dbt maintaining lineage across the entire mesh. Federated data ownership with full cross-organization data traceability.

Ecosystem and Integrations

dbt connects to platforms through adapters -- Python plugins translating operations into platform-specific SQL. Official adapters: Snowflake, BigQuery, Redshift, Databricks, PostgreSQL, Apache Spark, Microsoft Fabric. Community adapters extend to ClickHouse, DuckDB, Trino, Amazon Athena, SQL Server, and dozens more.

For orchestration, dbt Core integrates with Airflow (via Cosmos), Dagster (native support), Prefect, and Mage. dbt Cloud has its own scheduler, triggering jobs on Git push, API call, or cron.

For data quality and observability, test results and metadata feed into Elementary, Monte Carlo, and Datafold. The artifacts (manifest.json, run_results.json, catalog.json) contain rich project metadata used by third-party tools for lineage, impact analysis, and monitoring.

The Fusion Engine and AI Features

In 2025, dbt Labs introduced the Fusion engine -- a Rust rewrite replacing the Python execution engine. Fusion delivers near-instant SQL parsing, native SQL dialect understanding across platforms, live error detection, and faster builds. Currently in beta, starting with Snowflake.

Fusion also powers the dbt VS Code extension (compatible with Cursor and Windsurf), bringing IntelliSense, real-time compilation, and column-level lineage into the editor. Alongside Fusion, dbt Labs launched a Model Context Protocol (MCP) server that lets AI coding assistants query project metadata, column lineage, and documentation -- enabling AI-assisted analytics engineering where an LLM understands your project structure and generates contextually accurate transformations.

Who Uses dbt and When

dbt fits any organization loading data into a cloud warehouse that needs to transform it into analytics-ready models. The typical path starts with a data analyst or analytics engineer wanting version control, testing, and modularity for transformations that previously lived as ad-hoc scripts or embedded BI logic.

Startups often begin with dbt Core in Airflow or GitHub Actions. Mid-size companies adopt dbt Cloud for scheduling, the IDE, and governance. Enterprises use dbt Mesh to coordinate domain teams. The common thread: dbt works best when transformations are SQL executed inside the warehouse -- the ELT pattern.

JetBlue, HubSpot, NASDAQ, Dunelm, Anheuser-Busch, ThermoFisher Scientific, and thousands of other organizations run dbt in production, transforming data that powers dashboards, ML features, operational analytics, and regulatory reporting.