MySQL vs. PostgreSQL
Executive Summary
The selection of a Relational Database Management System (RDBMS) constitutes one of the most critical architectural decisions in modern software engineering. It dictates not only the immediate mechanisms of data persistence but also the long-term trajectory of application scalability, operational complexity, and developer velocity. For over two decades, the open-source database landscape has been defined by a binary choice between two dominant paradigms: MySQL, the pragmatic, speed-oriented engine of the web, and PostgreSQL, the rigorous, academically rooted object-relational system.
This report provides an exhaustive, expert-level analysis of these two technologies as of late 2025 (covering MySQL 8.4/9.0 and PostgreSQL 17). It transcends superficial feature comparisons to dissect the kernel-level architectural divergences—specifically the process-versus-thread connection models, heap-versus-clustered storage organizations, and distinct implementations of Multi-Version Concurrency Control (MVCC)—that drive their respective performance profiles. Furthermore, it integrates historical context, such as the seminal "Uber Migration" case study, with contemporary developments in vector search for Artificial Intelligence (AI) workloads, to provide a holistic strategic framework for engineering leaders.
The analysis reveals that while the functional gap between the two systems has narrowed, their optimal use cases remain distinct. MySQL continues to excel in read-intensive, massive-concurrency environments where its lightweight threading model and clustered index architecture provide superior throughput. PostgreSQL, conversely, has solidified its position as the "universal database," leveraging its extensible architecture to handle complex data modeling, geospatial analytics, and vector operations that effectively render specialized NoSQL systems redundant for many applications.
1. Historical Origins and Governance Models
To understand the technical divergence of MySQL and PostgreSQL, one must first examine their disparate origins and the governance structures that continue to influence their roadmaps.
1.1 PostgreSQL: The Academic Standard
PostgreSQL traces its lineage directly to the POSTGRES project at the University of California, Berkeley, initiated in 1986 under the guidance of Michael Stonebraker. The project was explicitly designed to address the limitations of existing relational systems (specifically Ingres) by introducing object-oriented concepts—such as inheritance and complex data types—into the relational model. This gave rise to the term "Object-Relational Database Management System" (ORDBMS).
The project was released as open source in 1996 as PostgreSQL. Crucially, it is not owned by a single corporation. Instead, it is governed by the PostgreSQL Global Development Group (PGDG), a decentralized collective of independent contributors and companies (including EDB, Microsoft, AWS, and Red Hat). This structure ensures that the roadmap is driven by community consensus and technical correctness rather than the commercial interests of a single vendor. The software is released under the PostgreSQL License, a permissive free software license similar to MIT or BSD, which allows for proprietary modification and redistribution without the obligation to release source code. This freedom has fostered a rich ecosystem of commercial derivatives, such as Greenplum (data warehousing) and TimescaleDB (time-series), which modify the core engine for specialized tasks.
1.2 MySQL: The Engine of the Web
MySQL was created in 1995 by Michael "Monty" Widenius and David Axmark in Sweden. Its design philosophy was radically different: pragmatism and speed over strict theoretical compliance. In its early years, MySQL lacked features considered standard in RDBMS theory, such as transactions, subqueries, and foreign keys, in favor of raw read speed. This made it the ideal companion for the burgeoning World Wide Web, cementing its place as the "M" in the ubiquitous LAMP stack (Linux, Apache, MySQL, PHP).
Ownership of MySQL has passed through corporate hands: from MySQL AB to Sun Microsystems in 2008, and subsequently to Oracle Corporation in 2010. While MySQL is open source under the GNU General Public License (GPLv2), its development is centralized within Oracle. The GPLv2 is a "copyleft" license, which mandates that any derivative work distributed to customers must also be open source. This has significant legal implications for Independent Software Vendors (ISVs) who wish to embed the database in proprietary appliances, often necessitating the purchase of a commercial license from Oracle.
The Oracle acquisition also sparked the creation of MariaDB, a fork led by Widenius to ensure a free version of the database remained available independent of Oracle’s control. While this report focuses on MySQL, it is important to note that the divergence between MySQL and MariaDB has grown, particularly in areas like JSON handling and query optimization.
2. Core Architectural Divergence
The most profound differences between MySQL and PostgreSQL are found in the "engine room"—the mechanisms by which they handle connections and manage memory. These low-level design choices impose hard constraints on scalability and resource utilization.
2.1 Connection Handling: Process vs. Thread
The mechanism by which the database server handles concurrent client connections is the single most significant architectural differentiator, influencing everything from memory consumption to connection pooling strategies.
PostgreSQL: The Process-Per-Connection Model
PostgreSQL employs a process-based architecture derived from traditional Unix design patterns. The central process, known as the postmaster, listens for incoming connections. Upon receiving a connection request, the postmaster calls the system fork() operation to spawn a new, dedicated backend process to handle that client session.
Isolation and Stability: This model prioritizes stability. In a process-based system, memory is protected by the operating system kernel. If a backend process encounters a fatal error—such as a segmentation fault caused by a buggy third-party extension or a malformed query—the kernel terminates only that specific process. The shared memory segment remains intact, and other active connections continue processing without interruption. This architectural resilience is a primary reason for PostgreSQL's reputation for "bulletproof" reliability in enterprise environments.
Resource Overhead: The trade-off is resource intensity. Forking a process, while optimized in modern kernels via Copy-on-Write (CoW), is significantly heavier than spawning a thread. Each PostgreSQL backend process requires its own private memory structures (Program Global Area or PGA equivalent) for catalog caches, sort buffers, and query plans. Even an idle connection consumes a non-trivial amount of RAM (often measured in megabytes).
Scalability Constraints: Consequently, PostgreSQL cannot natively scale to tens of thousands of concurrent connections on a single server without exhaustion of OS process limits or memory. In high-concurrency environments, it is practically mandatory to deploy an external connection pooler, such as PgBouncer or Pgpool-II. These tools maintain a small pool of persistent connections to the database while managing thousands of lightweight client connections, multiplexing queries onto the available backends.
MySQL: The Thread-Per-Connection Model
MySQL follows a multithreaded architecture. The primary server daemon, mysqld, is a single operating system process. When a client connects, mysqld spawns a new thread within its own address space to handle the session (or allocates one from a thread cache).
Efficiency: Threads are lightweight. They share the same memory space (heap) and file descriptors. Context switching between threads is computationally cheaper than switching between processes, resulting in lower CPU overhead per connection.
Memory Footprint: Because threads share global memory structures (such as the InnoDB Buffer Pool and code caches), the memory overhead per connection is minimal—often just the stack space and session variables. This allows MySQL to handle significantly higher numbers of concurrent connections (often 5,000 to 10,000+) on standard hardware without the immediate need for external multiplexing, although thread contention can eventually become a bottleneck.
Risk Profile: The shared memory model introduces a risk: a severe bug (e.g., a buffer overrun) in any single thread has the potential to corrupt the global heap, causing the entire database service to crash. While rare in the core server, this risk is elevated when using third-party plugins.
2.2 Memory Management Architecture
The distinct connection models necessitate different approaches to memory management, which in turn affect how database administrators (DBAs) tune the systems.
PostgreSQL: Shared Buffers vs. OS Cache
PostgreSQL relies heavily on the operating system's file system cache. The internal Shared Buffers (configured via shared_buffers) typically consume only 25% to 40% of available RAM. This area stores the "hot" pages of data and indexes.
Double Buffering: When PostgreSQL writes data, it writes to the Shared Buffers. Eventually, this is flushed to the OS cache, and then to disk. This can lead to "double buffering," where the same data page resides in both PostgreSQL's buffer and the OS page cache, effectively wasting RAM. However, this design simplifies the engine, allowing it to leverage the sophisticated paging algorithms of the Linux kernel.
Work Membrane: Memory for sorting (work_mem) and maintenance (maintenance_work_mem) is allocated per operation. A complex query with multiple sorts and hash joins can consume significant private memory, risking Out-Of-Memory (OOM) kills if many such queries run concurrently.
MySQL (InnoDB): The Buffer Pool
The InnoDB storage engine bypasses the OS cache for data management whenever possible. DBAs typically configure the InnoDB Buffer Pool (innodb_buffer_pool_size) to consume 70% to 80% of available RAM.
Direct Management: InnoDB manages its own caching of data pages, index pages, and insert buffers. It uses O_DIRECT to write to disk, bypassing the OS file system cache to avoid double buffering.
Optimization: This allows MySQL to implement custom eviction algorithms (LRU variants) specifically tuned for database access patterns, rather than relying on the generic file system heuristics of the OS. This centralized memory management is one factor contributing to MySQL's efficiency in read-heavy workloads.
3. Storage Engine Mechanics: The Data on Disk
The physical organization of data on disk (the storage engine) is arguably the most consequential differentiator between the two systems, dictating write amplification, index performance, and maintenance requirements.
3.1 Pluggable vs. Monolithic Architecture
MySQL employs a Pluggable Storage Engine architecture. The server layer handles SQL parsing and optimization, while the storage engine handles data retrieval and transactional storage.
InnoDB: The default and most widely used engine. It provides full ACID compliance, row-level locking, and crash recovery.
MyISAM: The legacy engine, fast but lacking transactions and relying on table-level locking. It is largely obsolete for modern applications but still exists.
Memory: Stores data in RAM for ephemeral needs.
CSV/Archive: Specialized engines for specific formats. This modularity allows users to swap engines for specific tables, although in practice, 99% of modern deployments utilize InnoDB exclusively due to its transactional safety.
PostgreSQL uses a unified, monolithic storage abstraction. While it supports Table Access Methods (introduced in version 12) allowing for pluggable storage (like zheap or OrioleDB), the standard Heap storage is used almost universally. The extensibility of Postgres is focused on data types and indexing methods rather than the raw storage format itself.
3.2 Heap vs. Clustered Index: The "Uber" Divide
The specific way InnoDB (MySQL) and Heap (PostgreSQL) organize data was the central technical reason behind Uber's high-profile migration from PostgreSQL to MySQL in 2016. This distinction is critical for architects to understand.
MySQL (InnoDB): Clustered Index Organization
In InnoDB, the table is the Primary Key. The data is organized physically as a B-Tree, where the Primary Key values act as the nodes, and the leaf nodes contain the actual row data.
Primary Key Lookups: Retrieving a row by its Primary Key (SELECT * FROM users WHERE id=1) is exceptionally fast (O(log N)) because the traversal of the index leads directly to the data payload. There is no second "hop."
Secondary Indexes: A secondary index (e.g., on email) stores the indexed value and the Primary Key of the row. To find a row via email, the engine searches the secondary index to find the PK, then searches the clustered index to find the row. This is known as a "double lookup."
Write Characteristics: If a row is updated but the Primary Key remains unchanged, secondary indexes do not need to be modified. This minimizes maintenance overhead on secondary indexes during updates.
PostgreSQL: Heap Storage
PostgreSQL stores table data in a "Heap"—an unordered collection of 8KB pages. The Primary Key is simply a B-Tree index, structurally identical to any secondary index.
Direct Pointers: All indexes (Primary and Secondary) store the indexed value and a pointer to the physical location of the row in the heap, known as the Tuple Identifier (TID) or ctid (comprising Block Number and Offset).
Secondary Index Speed: Lookups via secondary indexes are efficient because they point directly to the heap data; there is no intermediate primary key lookup.
The Update Problem (Write Amplification): In PostgreSQL's MVCC implementation, an UPDATE operation does not modify the row in place. Instead, it creates a new version of the row (a new tuple) at a new physical location (new TID).
Consequence: Because the physical location of the row has changed, every single index pointing to that row must be updated to point to the new TID, even if the indexed columns themselves were not modified.
Uber's Pain Point: For a table with a dozen indexes, updating a single non-indexed column (like last_login_time) forces the database to rewrite entries in all dozen indexes. This "Write Amplification" generates massive I/O load and burns through SSD write endurance, which was the primary driver for Uber's switch to MySQL.
Mitigation (HOT): PostgreSQL employs Heap-Only Tuples (HOT) optimization. If the new row version fits on the same physical page as the old version, Postgres can create a lineage chain within the page and avoid updating the indexes. However, if the page is full, HOT cannot be used, and write amplification occurs.
4. Concurrency Control and Transaction Isolation
Both databases implement Multi-Version Concurrency Control (MVCC) to allow readers to access data without being blocked by writers. However, their implementation of "versioning" differs radically, leading to different operational challenges regarding "bloat."
4.1 PostgreSQL: The Vacuum Model
In PostgreSQL, old versions of rows (dead tuples) are kept in the main heap alongside the live data. They are marked with visibility information (xmin and xmax transaction IDs).
Mechanism: When a transaction reads the table, it checks these IDs to see which version of the row is visible to its snapshot.
The Cost of History: Over time, tables accumulate dead tuples. If these are not removed, the table grows physically larger ("bloats"), slowing down sequential scans and wasting disk space.
Vacuuming: The Autovacuum daemon is responsible for scanning tables, identifying dead tuples that are no longer visible to any active transaction, and marking their space as reusable. Tuning autovacuum is a critical administrative task. If autovacuum cannot keep up with the write rate, performance degrades significantly.
Transaction Wraparound: PostgreSQL uses 32-bit transaction IDs. The system must periodically "freeze" old IDs to prevent the counter from wrapping around, which would cause data loss. This maintenance task is unique to Postgres and requires careful management in extreme-scale systems.
4.2 MySQL (InnoDB): The Undo Log Model
InnoDB updates rows in place. Before modifying a page, it writes the old value of the row to a separate structure called the Undo Log.
Snapshot Reconstruction: If a reader needs an older version of the row (for consistent read capability), InnoDB retrieves the current row and applies the changes from the Undo Log in reverse to reconstruct the snapshot.
Space Efficiency: Because old versions are moved to the Undo Log (which is a circular buffer or system tablespace), the main data files do not become bloated with dead rows. This generally keeps the primary storage footprint smaller and more predictable than PostgreSQL's heap.
Purge Threads: A background "Purge" thread is responsible for deleting Undo Log entries when they are no longer needed. If the Purge thread falls behind (e.g., due to a massive update job), the Undo Log grows, but the primary table scan performance is unaffected.
Rollback Performance: Rolling back a transaction in MySQL is expensive because the system must physically re-apply the old data from the Undo Log to the tables. In PostgreSQL, rollback is instantaneous (it simply marks the new tuples as aborted in the commit log).
Feature
PostgreSQL
MySQL (InnoDB)
Old Version Storage
Main Heap (alongside live data)
Undo Log (System Tablespace)
Cleanup Mechanism
VACUUM (Autovacuum Daemon)
Purge Threads
Bloat Impact
Main Table and Indexes grow
Undo Log grows (System space)
Rollback Speed
Instant
Slow (Requires physical revert)
Write Amplification
High (Updates affect all indexes)
Low (Updates affect only modified indexes)
5. Replication and High Availability
In distributed systems, the method of replicating data defines the system's availability and disaster recovery capabilities.
5.1 MySQL Replication: The Logical Approach
MySQL uses Binary Logs (binlogs) for replication. These logs contain the logical changes—either the SQL statements themselves (Statement-Based) or the row changes (Row-Based).
Flexibility: Logical replication is highly flexible. A replica can run a different version of MySQL, a different operating system, or even a different storage engine than the primary. It allows for complex topologies, such as circular replication or replicating specific databases/tables.
Global Transaction IDs (GTID): Modern MySQL uses GTIDs to track transactions across the cluster, simplifying failover and topology changes compared to the old file-position-based tracking.
Group Replication (InnoDB Cluster): MySQL offers a native High Availability solution called InnoDB Cluster. It uses a distributed consensus algorithm (Paxos-based) to ensure strong consistency and automatic failover. It provides a multi-primary mode where updates can occur on any node (with conflict detection), offering virtually synchronous replication.
5.2 PostgreSQL Replication: The Physical Approach
PostgreSQL's default replication is Streaming Replication, which transmits the Write-Ahead Log (WAL) records from the primary to the replicas.
Physical Exactness: This is physical, byte-for-byte replication. The replica is an exact binary copy of the primary's disk blocks. This ensures absolute data consistency and robust crash recovery.
Version Lock-in: Because the binary format of disk pages can change between major versions, streaming replication requires the primary and replicas to run the exact same major version of PostgreSQL. Upgrades typically require downtime or logical replication setups.
Logical Replication: Introduced in version 10, PostgreSQL now supports Logical Replication, which acts similarly to MySQL's binlog replication. This enables zero-downtime upgrades and ETL (Extract, Transform, Load) pipelines to external systems.
High Availability: Unlike MySQL, PostgreSQL does not ship with a built-in automated failover controller. High Availability is typically achieved using third-party tools like Patroni, which leverages a Distributed Consensus Store (such as etcd or Consul) to manage leader election and cluster state. Patroni is widely regarded as the industry standard for production Postgres HA.
6. Performance and Query Optimization
Performance is contextual. While synthetic benchmarks often favor one system, real-world performance depends heavily on the complexity of the queries and the nature of the workload.
6.1 The Query Optimizer
The PostgreSQL Query Optimizer (planner) is generally considered more sophisticated than MySQL's.
Join Algorithms: PostgreSQL supports Nested Loop, Hash Join, and Merge Join algorithms. Historically, MySQL only supported Nested Loops. While MySQL 8.0 introduced Hash Joins, PostgreSQL's implementation is mature and highly tuned for complex multi-table joins typical in analytical workloads.
Cost-Based Optimization: Both use cost-based optimizers, but PostgreSQL collects more granular statistics (histograms, most common values, correlation) allowing it to make better decisions for complex queries involving sub-selects and window functions.
Parallel Query: PostgreSQL can parallelize a single query across multiple CPU cores. This is a game-changer for analytical queries (OLAP) running on large servers. MySQL generally executes a single query in a single thread.
6.2 Read vs. Write Throughput
Read-Heavy (Simple): For simple primary key lookups (Point Selects), MySQL often outperforms PostgreSQL. The lightweight thread model and clustered index efficiency give it a distinct edge in raw throughput (Queries Per Second) for web-scale workloads. Benchmarks frequently show MySQL 8.0 leading by 15-20% in simple read scenarios.
Write-Heavy (Contention): PostgreSQL's implementation of group commit and its handling of high-concurrency writing (provided the write amplification issue is managed) often scales better on high-core-count machines.
Complex Queries: For queries involving CTEs, Window Functions (RANK(), LEAD()), or heavy aggregations, PostgreSQL dominates. Its execution engine is designed for complexity.
7. Extensibility and Modern Feature Sets
It is in the realm of features and extensibility that PostgreSQL distinguishes itself as more than just a relational database.
7.1 JSON and The "NoSQL" Capability
Both databases have responded to the rise of MongoDB by adding native JSON support, but the implementations differ.
PostgreSQL (JSONB): The JSONB type stores JSON in a decomposed binary format. Crucially, it supports Generalized Inverted Indexes (GIN). A GIN index can index every key and value in a JSON document effectively. This allows for queries like "Find all users where the preferences object contains dark_mode: true" to be executed using an index search, offering performance competitive with dedicated document stores.
MySQL (JSON): MySQL stores JSON in an optimized binary format that allows for quick read access to specific keys. However, it cannot directly index the entire JSON blob. To index a specific field, a developer must create a Generated Column (Virtual Column) that extracts the value, and then place a standard B-Tree index on that column. This requires knowing the access patterns in advance, whereas PostgreSQL's GIN is generic.
7.2 Geospatial Analysis: PostGIS
PostGIS: An extension for PostgreSQL that is widely considered the industry standard for geospatial data. It supports advanced features like topology, raster processing, 3D geometries, and geocoding. It strictly adheres to OGC (Open Geospatial Consortium) standards.
MySQL Spatial: MySQL has improved significantly in version 8.0, offering true geometric computations (on the ellipsoid) rather than bounding box approximations. However, it lacks the depth of analytical functions found in PostGIS. For simple "store locator" functionality, MySQL is sufficient; for GIS analysis (e.g., "calculate the watershed area of this river"), PostGIS is mandatory.
7.3 The Vector Revolution: AI and RAG
With the advent of Generative AI, databases must now store and search high-dimensional vector embeddings.
PostgreSQL (pgvector): The pgvector extension has become the de facto standard for vector search in the relational world. It adds a vector data type and supports HNSW (Hierarchical Navigable Small World) indexing for Approximate Nearest Neighbor (ANN) search. This allows developers to combine semantic search with relational filtering in a single SQL query (e.g., "find products similar to this image, but only if they are in stock and cost less than $50").
MySQL Vector Support: MySQL 9.0 (and forks like MariaDB) are introducing vector capabilities. While promising, the ecosystem is less mature. pgvector benefits from the broader PostgreSQL extension interface, allowing it to integrate deeply with the planner and index access methods.
7.4 Programmability and Extensions
PostgreSQL's architecture is designed to be extended. The Extension Network allows users to load shared libraries (shared_preload_libraries) that modify the database behavior.
Examples: TimescaleDB converts Postgres into a time-series database. Citus transforms it into a distributed, sharded cluster. ZomboDB integrates Elasticsearch.
MySQL allows plugins (mostly for authentication or storage engines), but it does not offer the same level of deep hooks into the query planner and executor that PostgreSQL exposes.
8. Operational Management and Ecosystem
8.1 Backups and Recovery
MySQL: Tools like Percona XtraBackup allow for hot, non-blocking physical backups. Logical backups are done via mysqldump or the newer mysqlpump. Point-in-Time Recovery (PITR) is achieved by replaying binary logs.
PostgreSQL: Physical backups are handled by tools like pgBackRest or WAL-G, which archive WAL files to object storage (S3). pg_dump handles logical backups. PITR is achieved by replaying the WAL archives.
8.2 Licensing Considerations
PostgreSQL: Released under the PostgreSQL License (permissive). This allows companies to fork, modify, and sell closed-source versions of Postgres without legal repercussions. This is ideal for embedding in proprietary appliances.
MySQL: Released under GPLv2 (copyleft). If an application links against MySQL client libraries or modifies the server, and is distributed to customers, the application code may need to be open-sourced. Oracle sells commercial licenses to exempt companies from this requirement. This dual-licensing model can be a friction point for legal compliance in enterprise software.
9. Strategic Decision Framework
Choosing between MySQL and PostgreSQL requires mapping technical capabilities to business requirements.
9.1 Scenario A: The High-Volume Web Application
Context: A B2C e-commerce platform or social media feed.
Requirements: Massive concurrency (100k+ users), simple read/write patterns (CRUD), high availability is non-negotiable.
Recommendation: MySQL. The lightweight thread model scales efficiently on hardware. The clustered index optimizes primary key lookups, which are 90% of the workload. Group Replication provides a native, easy-to-manage HA solution. The extensive talent pool of MySQL DBAs facilitates hiring.
9.2 Scenario B: The Enterprise Data Platform
Context: A fintech core banking system or a SaaS platform with complex data models.
Requirements: Strict ACID compliance, complex reporting queries (JOINs over 10 tables), JSON document storage, audit trails.
Recommendation: PostgreSQL. The superior query planner handles complex analytics without choking. Transactional DDL ensures that schema migrations do not corrupt the database. JSONB allows for flexible data modeling without needing a separate NoSQL store. The rigorous implementation of isolation levels safeguards financial data.
9.3 Scenario C: The Geospatial or AI-Driven App
Context: A logistics routing engine or a RAG-based chatbot.
Requirements: Spatial calculus (distance, intersection) or Vector similarity search.
Recommendation: PostgreSQL. PostGIS and pgvector are best-in-class solutions that live within the database. Using MySQL would likely require a separate specialized database (like Pinecone or Elasticsearch), increasing architectural complexity and data synchronization overhead.
10. Conclusion
In the comparison between MySQL and PostgreSQL, there is no objective "winner," only a more appropriate tool for a given context.
MySQL 8.4/9.0 remains the champion of specialized efficiency. It is a razor-sharp tool for online transaction processing (OLTP) in web environments. Its architecture minimizes overhead, maximizing the performance extracted from every CPU cycle and byte of RAM. For workloads defined by massive scale and simple access patterns—the "Facebook" or "Twitter" archetype—MySQL remains unbeatable.
PostgreSQL 17 stands as the champion of versatility and capability. It has successfully positioned itself as the "Universal Database." By offering features that rival document stores, spatial engines, and vector databases, all wrapped in a strictly standards-compliant SQL interface, it allows architects to consolidate their tech stack. It trades some raw throughput efficiency (due to processes and the heap) for an unmatched richness of features and extensibility.
For the majority of modern "greenfield" projects in 2025, PostgreSQL is the recommended default. Its permissive license, rich feature set, and ability to pivot from relational to document to vector workloads provide a safety net against changing requirements. However, legacy environments and specific high-scale read workloads will continue to thrive on the optimized, threaded architecture of MySQL. The choice is ultimately a trade-off between the specialized speed of the web vs. the generalized power of the enterprise.
Last updated