When using the Percona Operator for MySQL based on Percona XtraDB Cluster (PXC), it’s common to encounter scenarios where cluster nodes request a full State Snapshot Transfer (SST) when rejoining the cluster.

One typical scenario where a State Snapshot Transfer (SST) is required is when a node has been offline long enough that the GCache no longer contains the necessary write sets for an Incremental State Transfer (IST). Unlike SST, which involves a full data copy from another node, IST is a much lighter process that replays the missing write sets from the donor’s GCache, avoiding the need for a complete data transfer.

Another situation that triggers SST is scaling up the cluster by adding new nodes. Each joiner node will require a full SST to synchronize with the cluster.

Additionally, when adding multiple nodes at once, the cluster must perform a separate backup for each joiner. This results in repeated reads from the donor and multiple data transfers over the network, which can quickly become a bottleneck.

In PXC, SST is performed by default using Percona XtraBackup, a physical backup tool. The process involves reading the donor’s data files and streaming them to the joiner node. While the backup operation can be optimized by increasing parallelism and enabling compression, the data must still be read and transferred over the network.

This process can be time-consuming in environments with large database sizes, as it involves transferring a full backup from an existing node to a new one.

SST based on K8s Volume Snapshots:

These scenarios are ideal for K8s volume snapshots, which operate at the storage layer via the Container Storage Interface (CSI). This almost immediate process doesn’t involve compressing, sending data over the network, or even reading the full dataset.

The PXC Operator supports creating a new cluster from a volume snapshot, a useful feature for cloning or disaster recovery scenarios. In this blog post, however, we’ll explore how volume snapshots can also be used to add new nodes to an existing cluster, significantly reducing the time and resource cost, especially when dealing with large datasets.

Disclaimer:

The procedure described in this post involves directly manipulating PersistentVolumeClaims (PVCs), including deletion and restoration operations. These actions can lead to data loss or cluster instability if not performed carefully.

Ensure you have proper backups and fully understand the implications before proceeding in a production environment. Always test in a staging setup first.

For this test, I used Google Kubernetes Engine (GKE) with the Percona XtraDB Cluster Operator v1.16.1, running Percona Cluster 8.0.39 images. The PersistentVolumeClaims (PVCs) were 1 TiB in size, hosting a database dataset of approximately 500 GiB.

Prerequisites:

K8s relies on the CSI (Container Storage Interface) to manage volume operations, including snapshots. To use snapshots, your StorageClass must be associated with a CSI driver that supports the VolumeSnapshot feature.

The Volume Snapshot Class should be created first, as below:

There are two approaches to perform the PVC restore procedure: online, it allows to add nodes while the cluster is still running, and offline, which involves scaling down the cluster and performing the restore while all pods are stopped.

Online:

In this example, we’re re-joining existing nodes that would request SST using volume snapshots. The process for joining cluster members without downtime involves the following:

  1. Taking a snapshot of the volume from a healthy, running node.
  2. Scaling down the cluster temporarily to prepare for volume restoration.
  3. Deleting the PVCs associated with Joiner nodes.
  4. Restoring each Joiner PVC from the snapshot, so the nodes start with fully populated data volumes.

One important caveat is that the healthy pod used for the snapshot must be the number zero PXC member, pxc-0. This ensures that when the cluster is scaled down, the joiner nodes (e.g., pxc-1, pxc-2, etc.) are terminated, allowing their PVCs to be safely deleted and recreated from the snapshot. This is typically the most common scenario, as the pxc-0 pod is often the most up-to-date node in the cluster since, by default, the HAProxy service routes traffic to this member when it’s available, making it a reliable source for snapshotting.

Snapshots are crash-consistent by nature, they capture the state of the filesystem at a specific point in time without coordinating with the database to flush in-memory data to disk. When restored, InnoDB performs crash recovery to bring the database to a consistent state. To minimize the risk of data corruption or recovery failure, it’s critical to ensure that the instance used for the snapshot is fully ACID-compliant at the moment of capture.

This behavior is controlled by the innodb_flush_log_at_trx_commit variable. When set to 1, InnoDB writes and flushes the redo log to disk at every transaction commit, ensuring durability and reducing the chance of data loss during recovery.

By default, the PXC Operator sets innodb_flush_log_at_trx_commit to 2 to optimize performance. In terms of durability, a transaction on a PXC node is only considered committed after it has been replicated and certified by the cluster. While it may not yet be applied on the remote nodes, it has already been safely propagated, ensuring consistency across the cluster. This makes it generally safe to use the value 2, as you would need to lose all nodes simultaneously to lose up to one second of transactions.

We confirm the innodb_flush_log_at_trx_commit variable value by running the following command:

We’ll need to enforce stricter ACID compliance to take the snapshot, which may impact database performance due to an increased number of fsync operations:

Next, ensure the Donor instance retains the required write-sets to serve an Incremental State Transfer (IST) to the Joiners after the snapshot is restored. This is done by freezing the Gcache with the following command:

In write-intensive workloads, you may want to increase the pxc.livenessProbes.initialDelaySeconds from its default value of 300 seconds. This allows the instance more time to apply IST write sets before the liveness probe checks kick in, reducing the risk of premature pod restarts during recovery. Please note that this change will trigger a restart of the PXC pods, which is not the intended outcome of this procedure. This applies to both snapshot and regular XtraBackup SST. So, if you’ve previously handled SST under a heavy workload on this cluster, the pxc.livenessProbes.initialDelaySeconds setting should already be adjusted accordingly.

The next step is to create the sleep-forever file inside the pxc-0 data directory. This ensures the file is included in the snapshot and will be present on the Joiner nodes after restore. This will prevent the MySQL process from automatically running, giving the chance to adjust the node before it joins the cluster.

The following step is taking the snapshot from the pxc-0 pod PVC:

You can check the snapshot’s status. Once READYTOUSE changes to true, it will be ready.

Then, we must scale down the cluster to restore the snapshot. We will first need to set the spec.unsafeFlags.pxcSize to true to allow the cluster to scale down.

Once done, we can set only one replica for the PXC cluster:

We’ll see that only the pxc-0 pod is running:

Now we can delete the cluster1-pxc-1 PVC:

We can also restore the snapshot to a new PVC with the same name. The target PVC size should be at least the same as the original:

In this case, the restored snapshot PVC shows as Pending since the storage class volumeBindingMode is WaitForFirstConsumer. This means that it will wait until the pod starts to bind the PVC:

Now we can scale up the cluster to start the pod pxc-1:

We see the datadir-cluster1-pxc-1 is bound:

And the pod state shows as running:

Note that since we added the sleep-forever file, the MySQL process is not running.

We’ll need to delete the auto.cnf file, as it contains the pxc-0 MySQL server_uuid. Additionally, we must remove the gvwstate.dat file, which stores the Galera Primary Component information and the Galera node’s UUID, also inherited from pxc-0. Finally, we delete the sleep-forever file to allow the container to start the MySQL process:

We check the pods state:

You can check the wsrep_cluster_status status variable to confirm the node is now part of the Primary Component:

When using XtraBackup for SST, the process took approximately 75 minutes per instance to complete. In contrast, using volume snapshots, a node with a 500 GiB database was fully synced to the cluster in just 10 minutes.

You can reuse the same snapshot to repeat the process and add more Joiner nodes if necessary. This allows for efficient scaling without the overhead of creating new backups for each node.

Once the procedure is complete, we need to revert all the changes made to the cluster and to the pod pxc-0:

Finally, we delete the volume snapshot:

Offline:

In case there is a remaining node in the cluster other than pxc-0 or in case we want to perform a safer procedure without taking into account durability or IST, we can perform the offline method, which has the following steps:

  1. Scale down the cluster so that no PXC pods are running (0 replicas).
  2. Take a snapshot of the healthy pod’s PVC.
  3. Delete the PVCs associated with the Joiner nodes you want to recreate.
  4. Restore each Joiner PVC from the snapshot, ensuring the new volumes are fully populated before scaling the cluster back up.

In this scenario, let’s assume that pxc-2 is the only node currently part of the Primary Component, while pxc-0 and pxc-1 require SST to rejoin the cluster.

Similar to the online method, we’ll need to create the sleep-forever file inside the healthy node’s data directory. This file will be present in the restored PVCs, allowing us to pause the Joiner nodes on startup and perform any necessary adjustments before they attempt to join the cluster.

We will need to set the spec.unsafeFlags.pxcSize to true to allow the cluster to scale down.

Once done, we scaled down the replicas to 0 for the PXC cluster:

We check that all PXC pods are stopped:

Then, we take a snapshot from the healthy pod PVC. Since the database is stopped, this will be a database-consistent snapshot:

We’ll wait until the snapshot is ready to restore:

We check the PVC status:

We delete the PVCs from the Joiner pods, in this case, pxc-0 and pxc-1:

We restore the snapshot into pxc-0 and pxc-1 PVCs:

We check that the PVCs are created but pending binding:

We scale up the cluster to start all PXC pods:

We wait until all PVCs are bound:

And wait until all PXC pods state is Running:

Since we added the sleep-forever file, the pods did not start the MySQL process.

Next, we need to check the grastate.dat to know if it flags it as safe to bootstrap.

In this case, the safe_to_bootstrap value is set to 0, because when scaling down, the pxc-2 pod, the last active member of the Primary Component, was the first to be stopped. Meanwhile, the other pods (pxc-0 and pxc-1) were still connected and requesting SST, which prevents the cluster from marking any node as safe to bootstrap.

We’ll set safe_to_bootstrap to 1 on the pxc-0 node. This ensures that when the pod starts, it will bootstrap the cluster and become the primary member, allowing the cluster to initiate faster without waiting for other nodes. We’re also removing the auto.cnf file, since it contains the same server_uuid as the original pxc-2 node, from which the snapshot was taken.

On pxc-1, we only remove the auto.cnf file:

As for pxc-2, we don’t need to modify anything.

We remove the sleep-forever file in all pods to restart them:

We check the pods until all are in Running state:

We can also verify that each node has successfully joined the cluster by checking the wsrep_cluster_status status variable. A value of “Primary” confirms that the node is part of the Primary Component.

Finally, once all Joiner nodes are up and part of the Primary Component, we can safely delete the VolumeSnapshot to free up resources and avoid unnecessary storage costs.

In just 15 minutes, we successfully joined two nodes, each with a 500 GiB dataset, using the snapshot-based restore procedure. In contrast, performing the same operation with XtraBackup SST would take several hours, due to the time required to create, transfer, and apply a full physical backup.

Conclusion:

This approach significantly reduces the time and resources required to scale the Percona Operator for MySQL based on Percona XtraDB Cluster in Kubernetes environments. By leveraging VolumeSnapshots, we eliminate the overhead of full backup and restore cycles, reduce network traffic, and accelerate adding nodes to the cluster. It’s a powerful alternative to traditional SST, especially in cloud-native deployments where time, cost, and efficiency matter.

MySQL performance tuning

Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments