Recently, PolarDB topped the TPC-C benchmark test list with a performance that exceeded the previous record by 2.5 times, setting a new TPC-C world record for performance and cost-effectiveness with a performance of 2.055 billion transactions per minute (tpmC) and a unit cost of CNY 0.8 (price/tpmC).
Each seemingly simple number contains countless technical personnel's ultimate pursuit of database performance, cost-effectiveness, and stability. The pace of innovation in PolarDB has never stopped. A series of articles on "PolarDB's Technical Secrets of Topping TPC-C" are hereby released to tell you the story behind the "Double First Place". Please stay tuned!
This is the second article in the series - Limitless Clusters and Distributed Scaling.
We use PolarDB for MySQL 8.0.2 for this TPC-C benchmark test. PolarDB for MySQL supports the Multi-master Cluster (Limitless). You can change the configuration of one primary node and multiple read-only nodes to the Multi-master Cluster (Limitless) by adding read/write nodes.
PolarDB for MySQL Multi-master Cluster (Limitless) supports multi-master nodes, memory convergence, distributed storage, and scale-out in seconds. It also supports transparent cross-server scaling of single-table read and write operations to a massive number of nodes. In this TPC-C benchmark test, PolarDB supports 2.055 billion transactions per minute (tpmC) in the form of 2,340 read/write nodes (with a specification of 48 cores and 512 GB). It refreshes the world records in both TPC-C performance and cost-effectiveness rankings in terms of tmpC and price/tpmC.
The main form of the early cloud-native rational database is one primary node and multiple read-only nodes based on compute-storage separation. Although it can well replace the native or managed MySQL/PostgreSQL databases, its single-write architecture limits the scaling of its write capability. As cloud-native database technologies continuously evolve, PolarDB for MySQL Multi-master Cluster (Limitless), which supports large-scale deployment and read/write scale-out, is developed. Its core advantage lies in the ability to elastically scale compute nodes with each node capable of processing read and write requests. It can enable concurrent writes on multiple nodes. In addition, while retaining the high performance and high elasticity of a single server, it uses RDMA/CXL and other technologies to efficiently integrate transactions and data between nodes and realizes efficient and transparent cross-server read/write scale-out in seconds while ensuring transaction consistency.
This article describes the overall architecture of PolarDB for MySQL Multi-master Cluster (Limitless) and the core technical innovations that support its success in topping TPC-C.
Figure 1: Architecture of PolarDB for MySQL Multi-master Cluster (Limitless)
A PolarDB for MySQL cluster consists of multiple read/write compute nodes, multiple read-only compute nodes, and optional Cache Coordinator nodes. Its underlying layer is based on distributed shared storage. Each read/write node is readable and writable and communicates with Cache Coordinators over the RDMA/CXL high-speed network. Cache Coordinators are responsible for cluster metadata management and transaction information coordination. They also support the CDC feature and are in charge of outputting unified global binlogs.
In addition, the architecture dynamically distributes different partitions of a single partitioned table to different read/write nodes. The partition cropping module determines to which read/write node the execution plan should be routed. Therefore, different partitions can be concurrently written on each read/write node to implement the scale-out of write capabilities, which greatly improves the overall concurrent read/write performance of the cluster. Moreover, the cluster also supports cross-node distributed queries, DML operations, and DDL operations, and ensures the consistency and atomicity of cross-node transactions through RDMA memory fusion. The architecture also designs a global read-only node to support global queries, joint queries on multiple tables, and columnar query acceleration.
In PolarDB clusters, multiple read/write compute nodes use a symmetric architecture. A single read/write node provides the same features as the coordinator node (CN) and data node (DN) of a traditional distributed database. You do not need to separately configure the number of coordinator nodes and data nodes. The symmetric node offers significant advantages:
• It can fully enhance resource utilization and avoid resource waste when one CN/DN is full and the other is idle under different loads.
• It can reduce the extra communication cost caused by CN/DN separation and improve performance.
• Compute nodes use original MySQL syntax parsing/optimizer/executor to offer 100% MySQL compatibility.
The native InnoDB transaction system is implemented based on the active transaction list. However, maintaining the list requires protection by a large global lock, which is costly and can easily become a performance bottleneck of the entire system in high-concurrency scenarios. In addition, considering that most distributed transaction consistency solutions are based on commit timestamp schemes (TSO, TrueTime, and HLC), PolarTrans optimizes the native transaction system by using Commit Timestamp Store (CTS). The core data structure is the CTS log, which consists of a segment of ring buffers. The transaction ID trx_id is mapped to its corresponding slot by modulo, and each slot stores the trx pointer and the cts value. The core idea of the optimization is to remove the maintenance of complex data structures. CTS logs are used to record the core data of transactions, such as transaction status updates and transaction visibility. This operation is more lightweight. Additionally, PolarTrans optimizes most of the logic without locking. Therefore, PolarTrans greatly improves the performance in mixed read/write scenarios and write-only scenarios.
Figure 2: Structure diagram of CTS logs
The native InnoDB consistent snapshot read relies on the active transaction list to maintain transaction status and provide transaction visibility judgment. This mechanism exposes serious defects in distributed multi-write scenarios: maintaining a global active transaction list across nodes in a cluster causes high-frequency global lock competition, and scalability decreases sharply as the number of nodes increases. PolarDB builds a distributed transaction system based on PolarTrans and RDMA to provide distributed snapshot consistency.
We design and implement a cloud-native distributed transaction solution with hardware and software collaboration, which can realize good multi-server write scalability. Its core idea is to take advantage of the low latency of new hardware like RDMA to accelerate the execution of the Prepare and Commit phases in the commit process. Compared with traditional or existing database distributed transaction solutions, the core optimization points of the cloud-native distributed transaction solution with hardware and software collaboration are as follows:
PolarDB globally shares the full data of the dd table and splits the table. Different read/write nodes are responsible for reading and writing different table shards to improve the throughput and data handling capacity of a single table. In actual business scenarios, it is a difficult issue to control the DDL process of all table shards, coordinate the relationship with logical tables, and prevent metadata inconsistency in distributed DDL design.
To ensure high availability, traditional OLTP databases maintain one or more standby nodes for each primary node. We provide a variety of high-availability solutions for users to choose from so that they can reduce costs while maintaining the high availability of the system.
First, in the high availability solution based on private read-only nodes, users can choose to configure a private read-only node for each read/write node. The private read-only node and the read/write node share the same data without additional data storage. When a read/write node fails, its private read-only node can be switched to the read/write node within seconds to take over traffic requests. The overall performance of the cluster is not affected. On the other hand, if users do not configure a private read-only node for the read/write node, when the read/write node fails, we also provide a mechanism for read/write nodes to serve as standby nodes for each other to ensure high availability. When a read/write node fails, you can select another read/write node with a low load to quickly take over the traffic on the failed node. It only needs to remap the database and table on the failed node to the read/write node with a low load. However, the overall performance of the cluster may be affected under high pressure due to resource constraints.
To support global queries/joint queries on multiple tables across tables/databases, we design global read-only (RO) nodes. Considering that each read/write node in the cluster can access all the data in the shared storage, the global read-only node is designed as an aggregation database of multiple read/write nodes. Global read-only nodes can directly query the data written by all read/write nodes, avoiding querying data from multiple read/write nodes and aggregating them. There is no need to store an additional data copy for the global read-only node, which saves the additional storage overhead of the aggregation database. For global queries across multiple read/write nodes, the database cluster provides transparent routing, automatically routing global queries to the global read-only node through PolarProxy.
PolarDB for MySQL Limitless ultra-large clusters (2,340 read/write nodes with a specification of 48 cores and 512 GB) are tested in terms of performance by using the TPC-C benchmark program. TPC-C is considered to be the "Olympics" in the database field. It is the only internationally authoritative list in the OLTP (online transaction processing system) database performance testing.
Figure 3: Excerpt from PolarDB TPCC report
During the 8-hour continuous stress test, the overall tpmC fluctuation rate of the cluster has been within 0.16% (the standard requirement is within 2%), realizing 8-hour continuous, error-free, and stable stress testing. The overall tpmC of the cluster has reached 2.055 billion, setting a new TPC-C world record and ranking first in the world.
Figure 4: Performance data in PolarDB TPCC disaster recovery scenarios
In the disaster recovery scenario test, fault tests of different components are performed, and the real physical server is powered off. It is verified that the cluster can complete the failover within 10 seconds and restore the overall performance within 2 minutes in the case of unexpected physical server failure. The impact on the overall performance of the cluster during the disaster recovery test is controlled within 2% (the standard requirement is within 10%). Data maintenance and distributed transaction consistency are guaranteed.
Figure 5: Test results of scale-out in seconds
During the scale-out test, the four created databases are all on the read/write node 1. During the stress test, the number of read/write nodes is gradually increased and databases 2-4 are bound to the new read/write nodes in sequence. The figure shows the overall performance changes of the cluster during the scale-out process, indicating that the cluster can be scaled out in seconds.
PolarDB is the first cloud-native relational database that is deeply integrated with new hardware such as RDMA and CXL. PolarDB supports large-scale, high-performance, cross-server read/write scale-out and can scale up to thousands of compute nodes. This architecture supports high-performance transaction consistency across nodes, the cloud-native transaction system PolarTrans that is deeply integrated with RDMA technology, and elastic parallel query (ePQ). Its distributed capabilities enable near-linear performance scale-out.
In addition, through more than 90 optimization technologies such as index structures and I/O paths, each node in the cluster achieves the ultimate single-server performance. It is worth emphasizing that PolarDB for MySQL Multi-master Cluster (Limitless) has successfully set a new world record in TPC-C. This indicates that PolarDB's innovative multi-master multi-write cloud-native architecture, which can scale out read and write across servers, not only breaks the scalability bottleneck of a single cluster, but also successfully withstands the world's largest concurrent transaction peak, and is a global leader in performance, scalability, and other dimensions.
Technical Secrets of PolarDB: Standalone Performance Optimization
Integrate ApsaraDB RDS with Grafana for Monitoring via CloudMonitor
ApsaraDB - February 26, 2025
ApsaraDB - April 9, 2025
Alibaba Cloud Community - February 26, 2025
ApsaraDB - March 26, 2025
ApsaraDB - November 26, 2024
Alibaba Clouder - January 8, 2021
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreMore Posts by ApsaraDB