Technical Secrets of PolarDB: Limitless Clusters and Distributed Scaling

This article highlights PolarDB's record-setting performance and cost-effectiveness in the TPC-C benchmark, and introduces its limitless clusters and distributed scaling capabilities.

Recently, PolarDB topped the TPC-C benchmark test list with a performance that exceeded the previous record by 2.5 times, setting a new TPC-C world record for performance and cost-effectiveness with a performance of 2.055 billion transactions per minute (tpmC) and a unit cost of CNY 0.8 (price/tpmC).

Each seemingly simple number contains countless technical personnel's ultimate pursuit of database performance, cost-effectiveness, and stability. The pace of innovation in PolarDB has never stopped. A series of articles on "PolarDB's Technical Secrets of Topping TPC-C" are hereby released to tell you the story behind the "Double First Place". Please stay tuned!

This is the second article in the series - Limitless Clusters and Distributed Scaling.

Overview

We use PolarDB for MySQL 8.0.2 for this TPC-C benchmark test. PolarDB for MySQL supports the Multi-master Cluster (Limitless). You can change the configuration of one primary node and multiple read-only nodes to the Multi-master Cluster (Limitless) by adding read/write nodes.

PolarDB for MySQL Multi-master Cluster (Limitless) supports multi-master nodes, memory convergence, distributed storage, and scale-out in seconds. It also supports transparent cross-server scaling of single-table read and write operations to a massive number of nodes. In this TPC-C benchmark test, PolarDB supports 2.055 billion transactions per minute (tpmC) in the form of 2,340 read/write nodes (with a specification of 48 cores and 512 GB). It refreshes the world records in both TPC-C performance and cost-effectiveness rankings in terms of tmpC and price/tpmC.

The main form of the early cloud-native rational database is one primary node and multiple read-only nodes based on compute-storage separation. Although it can well replace the native or managed MySQL/PostgreSQL databases, its single-write architecture limits the scaling of its write capability. As cloud-native database technologies continuously evolve, PolarDB for MySQL Multi-master Cluster (Limitless), which supports large-scale deployment and read/write scale-out, is developed. Its core advantage lies in the ability to elastically scale compute nodes with each node capable of processing read and write requests. It can enable concurrent writes on multiple nodes. In addition, while retaining the high performance and high elasticity of a single server, it uses RDMA/CXL and other technologies to efficiently integrate transactions and data between nodes and realizes efficient and transparent cross-server read/write scale-out in seconds while ensuring transaction consistency.

This article describes the overall architecture of PolarDB for MySQL Multi-master Cluster (Limitless) and the core technical innovations that support its success in topping TPC-C.

1. Overall Architecture of PolarDB for MySQL Multi-master Cluster (Limitless)

Figure 1: Architecture of PolarDB for MySQL Multi-master Cluster (Limitless)

A PolarDB for MySQL cluster consists of multiple read/write compute nodes, multiple read-only compute nodes, and optional Cache Coordinator nodes. Its underlying layer is based on distributed shared storage. Each read/write node is readable and writable and communicates with Cache Coordinators over the RDMA/CXL high-speed network. Cache Coordinators are responsible for cluster metadata management and transaction information coordination. They also support the CDC feature and are in charge of outputting unified global binlogs.

In addition, the architecture dynamically distributes different partitions of a single partitioned table to different read/write nodes. The partition cropping module determines to which read/write node the execution plan should be routed. Therefore, different partitions can be concurrently written on each read/write node to implement the scale-out of write capabilities, which greatly improves the overall concurrent read/write performance of the cluster. Moreover, the cluster also supports cross-node distributed queries, DML operations, and DDL operations, and ensures the consistency and atomicity of cross-node transactions through RDMA memory fusion. The architecture also designs a global read-only node to support global queries, joint queries on multiple tables, and columnar query acceleration.

In PolarDB clusters, multiple read/write compute nodes use a symmetric architecture. A single read/write node provides the same features as the coordinator node (CN) and data node (DN) of a traditional distributed database. You do not need to separately configure the number of coordinator nodes and data nodes. The symmetric node offers significant advantages:

• It can fully enhance resource utilization and avoid resource waste when one CN/DN is full and the other is idle under different loads.

• It can reduce the extra communication cost caused by CN/DN separation and improve performance.

• Compute nodes use original MySQL syntax parsing/optimizer/executor to offer 100% MySQL compatibility.

2. High-performance Cross-server Read/Write Scale-out

2.1 Cloud-native Transaction System PolarTrans

The native InnoDB transaction system is implemented based on the active transaction list. However, maintaining the list requires protection by a large global lock, which is costly and can easily become a performance bottleneck of the entire system in high-concurrency scenarios. In addition, considering that most distributed transaction consistency solutions are based on commit timestamp schemes (TSO, TrueTime, and HLC), PolarTrans optimizes the native transaction system by using Commit Timestamp Store (CTS). The core data structure is the CTS log, which consists of a segment of ring buffers. The transaction ID trx_id is mapped to its corresponding slot by modulo, and each slot stores the trx pointer and the cts value. The core idea of the optimization is to remove the maintenance of complex data structures. CTS logs are used to record the core data of transactions, such as transaction status updates and transaction visibility. This operation is more lightweight. Additionally, PolarTrans optimizes most of the logic without locking. Therefore, PolarTrans greatly improves the performance in mixed read/write scenarios and write-only scenarios.

Figure 2: Structure diagram of CTS logs

2.2 Distributed Snapshot Consistency

The native InnoDB consistent snapshot read relies on the active transaction list to maintain transaction status and provide transaction visibility judgment. This mechanism exposes serious defects in distributed multi-write scenarios: maintaining a global active transaction list across nodes in a cluster causes high-frequency global lock competition, and scalability decreases sharply as the number of nodes increases. PolarDB builds a distributed transaction system based on PolarTrans and RDMA to provide distributed snapshot consistency.

All nodes in a cluster are interconnected through RDMA. Cross-server sharing of CTS logs in memory that are registered by RDMA enables global transaction status information synchronization and direct updates across physical servers. This avoids maintaining a global active transaction list on a central node. In addition, CTS logs provide remote transaction commit and status query capabilities.
The TSO or HLC timestamp management solution is used to generate transaction commit timestamps. All sub-transactions of a distributed write transaction use the same timestamp for atomic commit. The distributed read transaction uses a unified timestamp and CTS logs to check data visibility and ensure distributed read consistency.

2.3 Multi-server Transaction Consistency: Cloud-native Distributed Transactions

We design and implement a cloud-native distributed transaction solution with hardware and software collaboration, which can realize good multi-server write scalability. Its core idea is to take advantage of the low latency of new hardware like RDMA to accelerate the execution of the Prepare and Commit phases in the commit process. Compared with traditional or existing database distributed transaction solutions, the core optimization points of the cloud-native distributed transaction solution with hardware and software collaboration are as follows:

By leveraging the low-latency advantage of new RDMA hardware and the LSN mechanism, the execution process of the Prepare phase of a distributed transaction is optimized. The Prepared state of the transaction can be determined only by remote RDMA read operations. This avoids the SQL interaction overhead between the coordinator node and the participant node in the Prepare phase.
CTS logs record the transaction commit timestamp, and the coordinator node uses RDMA to write the transaction commit timestamp cts to the CTS log of the participant node with microsecond latency. The execution logic of the Commit phase is optimized so that the Commit phase can be executed asynchronously.

2.4 Multi-server Parallel DDL

PolarDB globally shares the full data of the dd table and splits the table. Different read/write nodes are responsible for reading and writing different table shards to improve the throughput and data handling capacity of a single table. In actual business scenarios, it is a difficult issue to control the DDL process of all table shards, coordinate the relationship with logical tables, and prevent metadata inconsistency in distributed DDL design.

DDL concurrency control: In the distributed table scenario, to prevent concurrent DDL operations from occurring in the same table, we reuse the concept and management logic of database/table locks, add an XB lock type for global databases/tables, and take the XB lock first when executing DDL, thus avoiding concurrent DDL.
Distributed DDL atomicity guarantee: A mechanism based on the multi-phase commit protocol is designed. At any time before the Commit phase, if an exception occurs in the DDL execution process, it will be immediately reported to the coordinator node (CN). The CN sends instructions to all DNs to roll back the DDL operations executed on the DN, thus ensuring the atomicity of distributed DDL operations.

3. High Availability Mechanism

To ensure high availability, traditional OLTP databases maintain one or more standby nodes for each primary node. We provide a variety of high-availability solutions for users to choose from so that they can reduce costs while maintaining the high availability of the system.

First, in the high availability solution based on private read-only nodes, users can choose to configure a private read-only node for each read/write node. The private read-only node and the read/write node share the same data without additional data storage. When a read/write node fails, its private read-only node can be switched to the read/write node within seconds to take over traffic requests. The overall performance of the cluster is not affected. On the other hand, if users do not configure a private read-only node for the read/write node, when the read/write node fails, we also provide a mechanism for read/write nodes to serve as standby nodes for each other to ensure high availability. When a read/write node fails, you can select another read/write node with a low load to quickly take over the traffic on the failed node. It only needs to remap the database and table on the failed node to the read/write node with a low load. However, the overall performance of the cluster may be affected under high pressure due to resource constraints.

4. Global Read-only Nodes

To support global queries/joint queries on multiple tables across tables/databases, we design global read-only (RO) nodes. Considering that each read/write node in the cluster can access all the data in the shared storage, the global read-only node is designed as an aggregation database of multiple read/write nodes. Global read-only nodes can directly query the data written by all read/write nodes, avoiding querying data from multiple read/write nodes and aggregating them. There is no need to store an additional data copy for the global read-only node, which saves the additional storage overhead of the aggregation database. For global queries across multiple read/write nodes, the database cluster provides transparent routing, automatically routing global queries to the global read-only node through PolarProxy.

5. Extreme Performance and Scalability

PolarDB for MySQL Limitless ultra-large clusters (2,340 read/write nodes with a specification of 48 cores and 512 GB) are tested in terms of performance by using the TPC-C benchmark program. TPC-C is considered to be the "Olympics" in the database field. It is the only internationally authoritative list in the OLTP (online transaction processing system) database performance testing.

5.1 Stable Performance in Ultra-large Clusters

Figure 3: Excerpt from PolarDB TPCC report

During the 8-hour continuous stress test, the overall tpmC fluctuation rate of the cluster has been within 0.16% (the standard requirement is within 2%), realizing 8-hour continuous, error-free, and stable stress testing. The overall tpmC of the cluster has reached 2.055 billion, setting a new TPC-C world record and ranking first in the world.

5.2 High Availability and Disaster Recovery Capabilities in Ultra-large Clusters

Figure 4: Performance data in PolarDB TPCC disaster recovery scenarios

In the disaster recovery scenario test, fault tests of different components are performed, and the real physical server is powered off. It is verified that the cluster can complete the failover within 10 seconds and restore the overall performance within 2 minutes in the case of unexpected physical server failure. The impact on the overall performance of the cluster during the disaster recovery test is controlled within 2% (the standard requirement is within 10%). Data maintenance and distributed transaction consistency are guaranteed.

5.3 Scale-out Capabilities in Seconds

Figure 5: Test results of scale-out in seconds

During the scale-out test, the four created databases are all on the read/write node 1. During the stress test, the number of read/write nodes is gradually increased and databases 2-4 are bound to the new read/write nodes in sequence. The figure shows the overall performance changes of the cluster during the scale-out process, indicating that the cluster can be scaled out in seconds.

Summary

PolarDB is the first cloud-native relational database that is deeply integrated with new hardware such as RDMA and CXL. PolarDB supports large-scale, high-performance, cross-server read/write scale-out and can scale up to thousands of compute nodes. This architecture supports high-performance transaction consistency across nodes, the cloud-native transaction system PolarTrans that is deeply integrated with RDMA technology, and elastic parallel query (ePQ). Its distributed capabilities enable near-linear performance scale-out.

In addition, through more than 90 optimization technologies such as index structures and I/O paths, each node in the cluster achieves the ultimate single-server performance. It is worth emphasizing that PolarDB for MySQL Multi-master Cluster (Limitless) has successfully set a new world record in TPC-C. This indicates that PolarDB's innovative multi-master multi-write cloud-native architecture, which can scale out read and write across servers, not only breaks the scalability bottleneck of a single cluster, but also successfully withstands the world's largest concurrent transaction peak, and is a global leader in performance, scalability, and other dimensions.

Community

Technical Secrets of PolarDB: Limitless Clusters and Distributed Scaling

Overview

1. Overall Architecture of PolarDB for MySQL Multi-master Cluster (Limitless)

2. High-performance Cross-server Read/Write Scale-out

2.1 Cloud-native Transaction System PolarTrans

2.2 Distributed Snapshot Consistency

2.3 Multi-server Transaction Consistency: Cloud-native Distributed Transactions

2.4 Multi-server Parallel DDL

3. High Availability Mechanism

4. Global Read-only Nodes

5. Extreme Performance and Scalability

5.1 Stable Performance in Ultra-large Clusters

5.2 High Availability and Disaster Recovery Capabilities in Ultra-large Clusters

5.3 Scale-out Capabilities in Seconds

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Cloud-Native Applications Management Solution