How to Choose a Database for Your Next Project

A practical, engineering-first framework for picking a database: start from access patterns and consistency needs, understand ACID vs BASE and CAP/PACELC, weigh operational cost, and know when to reach past Postgres - with a decision checklist.

The database is the one decision in your stack that resists being changed later. Frameworks get swapped, services get rewritten, the frontend gets a new coat of paint every few years. The database accretes data, schema, integrations, and operational scar tissue until migrating it becomes a multi-quarter project nobody wants to staff. So the choice deserves more than a reflex toward whatever you used last time.

Here is the short version, and the rest of this post defends it: start with PostgreSQL unless you have a specific, measurable reason not to. Then let your access patterns, consistency requirements, scale, and operational budget tell you when a specialized engine actually earns its place. Most projects never reach that point. The ones that do usually need one specialized store alongside Postgres, not a constellation of five.

Start with the workload, not the database

The wrong first question is "SQL or NoSQL?" The right first question is "what does this system actually do to data?" Read-heavy or write-heavy? Point lookups by key, or wide scans and aggregations? Strict transactional integrity, or fire-and-forget event ingestion? The workload determines the engine; the engine does not get to determine the workload.

The sharpest dividing line is OLTP versus OLAP. OLTP (online transaction processing) means high volumes of short reads and writes against current state - the canonical "fetch this user, update that order" pattern. Row-oriented storage with B-tree indexes serves these in single-digit milliseconds because the whole row lives together on disk. OLAP (online analytical processing) means wide scans and aggregations over historical data, where you touch billions of rows but only a handful of columns. Columnar storage with vectorized execution and data skipping wins here, returning aggregates in sub-second time that would cripple a row store. Run analytical queries against your OLTP database for long enough and you will feel exactly why these are different engines.

Access pattern detail matters down to how you write predicates. A sargable ("search-argument-able") predicate can use an index to avoid a full scan; a non-sargable one cannot. WHERE created_at >= '2026-01-01' is sargable and rides a B-tree index. Wrap the column in a function - WHERE date_part('year', created_at) = 2026 - and the planner has to evaluate every row, because it cannot reason about the function's output from the index order. Same logical question, very different cost. This is also why indexing is a tradeoff and not a free win: every index you add speeds reads and taxes every insert, update, and delete that now has to maintain it. Add the indexes your access patterns demand, and not one more. Our PostgreSQL indexes guide goes deep on choosing the right index per query.

Consistency and scale: where the real tradeoffs live

Two database systems can both call themselves "consistent" and mean completely different things. Pin down the vocabulary before it costs you a production incident.

ACID describes transactional guarantees: Atomicity (all-or-nothing), Consistency (every commit preserves declared constraints), Isolation (concurrent transactions do not interfere), Durability (committed data survives a crash). BASE is the NoSQL counterpoint - Basically Available, Soft state, Eventual consistency - trading strict guarantees for availability and write scalability. Worth flagging: the "C" in ACID (your schema's integrity rules) is a different thing from the "C" in CAP (all nodes seeing the same data). Conflating the two is one of the most common sources of confusion in database discussions.

Once data lives on more than one node, the CAP theorem applies: during a network partition you can keep Consistency or Availability, not both. Since partitions are inevitable in real networks, CAP collapses to a C-versus-A choice when things go wrong. The more useful framing is PACELC: if Partition, choose Availability or Consistency; Else, choose Latency or Consistency. PACELC captures what CAP omits - that even when the network is healthy, stronger consistency demands coordination that costs latency. Cassandra and Dynamo-style stores sit at PA/EL (availability and low latency); Spanner and HBase-style systems sit at PC/EC (consistency on both sides of the choice).

"Eventual consistency" is not a single setting either. It is a spectrum, and you should know which rung you are standing on:

Consistency model	Guarantee	Typical cost
Linearizability	Every read returns the latest write, as if operations happened instantaneously in real-time order	Highest latency; cross-node coordination
Sequential	One global order consistent with each client's program order, not tied to wall-clock time	High
Causal	Causally related operations seen in order everywhere; concurrent ops may differ	Moderate
Read-your-writes	A client always sees its own prior writes (a session guarantee)	Low
Eventual	Replicas converge once writes stop; temporary divergence allowed	Lowest

Scale is the other axis. Vertical scaling means a bigger machine: simple, and the path of least resistance until you hit a ceiling. Horizontal scaling means sharding data across nodes, which is how wide-column, document, and distributed SQL systems absorb write growth. Distributed SQL engines like CockroachDB and TiDB deliver near-linear throughput scaling while keeping ACID, but serializable isolation across nodes adds write-coordination overhead you pay on every transaction. There is no free horizontal scale; you buy it with latency, complexity, or both. A single well-tuned Postgres box handles far more load than most teams assume - reach for distribution when you have measured the wall, not before.

The database families and what each is actually for

Categories are a starting map, not a verdict. Most engines blur their boundaries now (Postgres does JSON, vectors, and time-series; many "multi-model" systems span several rows below). Match the family to your dominant workload, then validate the specific engine.

Family	Leading examples	Best at	Watch out for
Relational / SQL	PostgreSQL, MySQL, SQL Server, SQLite	ACID transactions, joins, point lookups and range scans on a known schema	Analytical scans; rigid schema migrations
Document	MongoDB, Couchbase, Postgres JSONB	Evolving schemas, nested/semi-structured data, high write throughput	Modeling discipline; cross-document consistency
Key-value	Redis, DynamoDB, etcd	Caching, sessions, single-digit-ms lookups at scale	Anything needing rich queries or joins
Wide-column	Cassandra, ScyllaDB, HBase	Massive write throughput, IoT/time-series ingestion, HA	Query flexibility; eventual consistency by default
Graph	Neo4j, Neptune, ArangoDB	Relationship traversal, fraud, recommendations, knowledge graphs	No standard query language; whole-graph scans
Time-series / OLAP	ClickHouse, TimescaleDB, InfluxDB	Aggregations over billions of timestamped rows	Transactions (ClickHouse has none); storage vs query tradeoffs
Vector	pgvector, Milvus, Pinecone, Qdrant, Weaviate	Similarity search over embeddings for RAG and semantic search	Recall tuning; scaling ceilings on simpler options
Search	Elasticsearch, OpenSearch, Solr	Full-text relevance, fuzzy matching, log analytics	Not a system of record; operational weight
Distributed SQL	Spanner, CockroachDB, YugabyteDB, TiDB	SQL + ACID + horizontal scale + geo-distribution	Coordination latency; operational complexity

A few honest notes on the popular paths. NoSQL document stores buy schema flexibility and write scale, but you pay it back in modeling discipline and weaker query/reporting capabilities - the schema does not disappear, it just moves from the database into your application code (schema-on-read versus schema-on-write, in Martin Kleppmann's framing). Vector databases are genuinely necessary at hundreds of millions of embeddings, but pgvector handles the 10-100M range that covers most real RAG systems without adding a second datastore. Graph databases are unbeatable for relationship traversal and a poor fit for warehouse-style scans. The pattern repeats: each specialized engine wins decisively in its lane and underperforms a general-purpose database everywhere else.

The costs nobody budgets for

The sticker price of a database is the smallest line item. The real costs show up in the org chart and the on-call rotation.

Every datastore you add needs its own backups, monitoring, upgrade path, capacity planning, and a human who understands its failure modes at 3 a.m. This is the trap in polyglot persistence - using a different database for each service or access pattern. It sounds principled, and at sufficient scale it is. Below it, the operational tax dominates. If your team is under roughly 8-10 engineers or lacks dedicated infrastructure expertise, the burden of running three specialized stores usually outweighs the marginal benefit each provides. There is also a correctness cost that is easy to miss: there is no distributed transaction spanning Postgres, Redis, and ClickHouse. Independent commit boundaries and clocks mean polyglot setups fragment your consistency guarantees in ways no application-level code fully papers over.

Licensing and hosting deserve real scrutiny up front, because both have teeth. Managed services (DynamoDB, Snowflake, Aurora, Pinecone) remove the operational burden at a higher recurring cost and some lock-in. Self-hosted open-source (Postgres, ClickHouse, Cassandra) is cheaper at scale but demands the expertise above. Factor the full lifecycle: local dev environment per engineer, ephemeral databases for CI and feature branches, load-test environments, staging, and production - sometimes at different license tiers. And keep an eye on relicensing risk; the last few years have seen several once-permissive databases move to source-available licenses, which can reprice your stack overnight.

Then there is the cost that compounds for a decade: market support. A boring, popular database has well-understood failure modes, deep tooling, abundant documentation, and a large hiring pool. A niche engine has none of those when you need them most. This is the core of Dan McKinley's Choose Boring Technology: "Let's say every company gets about three innovation tokens. You can spend these however you want, but the supply is fixed for a long while." Spend them on the product that differentiates you, not on replacing a database that already works. Boring does not mean bad - it means the capabilities and the failure modes are known quantities.

A decision checklist

Run a candidate through these questions before committing. The order is roughly the order in which a wrong answer hurts.

Regulatory constraints. Are there privacy, residency, or compliance rules (GDPR, data residency) that immediately rule options in or out? Decide this first; it is non-negotiable.
Access patterns. OLTP, OLAP, or both? Point lookups, range scans, full-text, similarity, or graph traversal? This selects the family.
Consistency needs. Where on the linearizable-to-eventual spectrum does the business actually live? Be honest - most systems tolerate read-your-writes and do not need linearizability everywhere.
Scale and growth. Transactions per second today, and the realistic 3-year trajectory. Does a single node carry you, or do you genuinely need horizontal write scaling?
Operational budget. Who runs this, and can they run it well? Managed or self-hosted? How many distinct stores can the team actually operate?
Cost over time. License, infrastructure, and salary across every environment, projected out years not weeks.
Ecosystem and longevity. Will you find tooling, documentation, and people who know this database a decade from now?
Interoperability. What else talks to this data, and who owns the schema?

If the answers do not clearly point somewhere specialized, the answer is Postgres. It is the most-used database among developers for the second year running (48.7% of all respondents, 51.9% of professionals), and its extension ecosystem lets one instance cover relational, JSON, vector, time-series, and geospatial workloads before you ever split it apart.

Key takeaways

Start from the workload. Access patterns (OLTP vs OLAP, sargable predicates, read/write mix) select the database. The database never gets to select the workload.
Default to PostgreSQL. It is the developer default for good reasons and covers most workloads - including vectors and JSON - without a second datastore.
Know your consistency vocabulary. ACID is not BASE, ACID's "C" is not CAP's "C," and "eventual consistency" spans linearizable down to convergent. PACELC beats CAP for reasoning about real systems.
Specialized engines earn their place, they are not granted it. Reach for a vector, graph, columnar, or distributed SQL store when you have measured a wall a single Postgres node cannot clear.
Budget the operational cost, not the sticker price. Every added datastore is backups, monitoring, upgrades, on-call, and a fragmented consistency story. Polyglot persistence is a scale tool, not a default.
Spend your innovation tokens on the product. Boring, well-supported databases have known failure modes and a deep hiring pool. That is a feature.