Replication and failover architecture

cnmsql builds a primary-replica topology with Percona Server for MySQL GTID replication. The operator owns topology policy, while each instance manager owns local mysqld role changes. This split keeps primary changes declarative: the operator writes the intended primary in Cluster status, and the pods converge themselves toward that target.

An alternative Group Replication topology is also available. When spec.replication.mode: groupReplication, the group itself owns membership, write certification, and primary election. The operator becomes an observer of the group's decisions. This page documents the default async replication mode.

Replication model

Replicas are created from a physical XtraBackup stream taken from the current primary. After prepare and copy-back, the instance manager configures MySQL with GTID auto-positioning so it follows the primary from the restored GTID point.

MySQL transport TLS is rendered for every instance. Replication uses the per-instance certificate material and a dedicated replication account requiring X509. The application-facing require_secure_transport setting remains a user choice through spec.mysql.parameters.

Semi-synchronous replication can be enabled with:

spec:
  instances: 3
  minSyncReplicas: 1
  maxSyncReplicas: 1
  mysql:
    semiSync:
      enabled: true
      timeoutMillis: 1000

Semi-sync improves failover durability when configured so acknowledged commits reach at least one replica. Without that guarantee, an acknowledged write that only existed on a lost primary can be lost before a replica or the object store sees it.

Semi-sync self-healing (data durability)

With semi-sync on, the primary blocks each commit until minSyncReplicas replicas acknowledge it. If a synchronous replica becomes unhealthy, that floor can stall writes. spec.mysql.semiSync.dataDurability decides how the operator responds:

spec:
  minSyncReplicas: 2
  mysql:
    semiSync:
      enabled: true
      dataDurability: preferred # or "required"

With preferred (the default), the operator keeps lowering the primary's required acknowledgement count (rpl_semi_sync_source_wait_for_replica_count) to the number of healthy replicas, never below one, so writes keep flowing during a replica outage. It raises the count back to minSyncReplicas as replicas recover. This favours availability over strict durability.

With required, the count stays pinned to minSyncReplicas. When fewer healthy replicas can acknowledge, writes block until timeoutMillis elapses and replication falls back to async. This favours durability over availability.

The operator applies the change on the primary over the mTLS control API during its steady-state reconcile. It is a runtime adjustment only; the static my.cnf floor does not change.

Probes

The liveness probe (/livez) does not depend on mysqld being up. The manager answering the probe is itself the liveness signal, so a deliberately stopped mysqld (for example while the instance is fenced) does not trigger a kubelet restart. The only failure mode that restarts a container is a primary that has lost contact with the Kubernetes API server: each cluster-managed primary keeps probing the API server, and if it cannot reach it for 30 seconds it treats itself as network-isolated and fails liveness. This is a last-resort guard against a partitioned primary the operator can no longer coordinate, which is a split-brain risk. The check runs locally inside the instance, so a genuinely isolated primary still restarts itself even while it is unreachable from the control plane. A replica never restarts itself on isolation; there is nothing to protect against.

The startup probe (/startupz) gates on mysqld first accepting connections, so the container is considered started only once the server is up. The readiness probe (/readyz) additionally requires healthy replication on a replica, so a replica that is up but not replicating is held out of routing.

Fencing an instance

Fencing takes a single instance out of service without deleting it or its data. Use the plugin:

kubectl cnmsql fence on <cluster> <cluster>-2
kubectl cnmsql fence off <cluster> <cluster>-2

The operator drops the Pod from every routing Service (rw, ro, r, and any user-defined ones) by clearing its routable label, and records it under status.fencedInstances. The instance's in-Pod reconciler reads that list and stops mysqld. The manager stays alive as PID 1 so the Pod keeps running and keeps answering its control and liveness endpoints; only the database is down. The continuous archiver stands down with it, and a fenced instance is skipped as a failover candidate, so the operator never promotes it. Because mysqld is stopped, the Pod reports NotReady and shows as 0/1 Running. Unfencing restarts mysqld and the instance rejoins normal routing and role reconciliation.

Fencing the primary stops writes for the whole cluster, because the rw Service loses its only endpoint. That is deliberate: use it to freeze an instance for inspection or maintenance rather than as a failover trigger.

Primary lease fencing

Each cluster creates a standard coordination.k8s.io/v1 Lease named {cluster}-primary. Before an instance promotes itself to primary, it must acquire this lease. While it is primary, it renews the lease every 30 seconds. When it demotes or shuts down, it releases (deletes) the lease.

During automatic failover, the operator checks whether the old primary's lease is still held before promoting a candidate. If the lease is still active, failover waits for the lease TTL (15 seconds) before proceeding. This prevents the operator from promoting a new primary while the old one might still be accepting writes on a network-partitioned node.

Combined with Pod deletion, this gives two independent timeouts before the old primary can be replaced: the lease TTL and the kubelet Pod termination grace period. The split-brain window narrows to whichever expires first.

Feature gate

spec:
  enablePrimaryLease: false

The feature is on by default. Set it to false to skip all lease management. This is safe for single-instance or test clusters where the extra API call is unnecessary.

Lifecycle

Event	Lease action
Instance promotes itself	Acquire (create or take over the lease, set `holderIdentity`)
Instance is already primary	Renew `renewTime`
Instance demotes or is fenced	Delete the lease
Instance shuts down	Delete the lease (graceful shutdown handler)
Lease acquisition fails	Instance does not promote; retries on next reconcile
API server unreachable	Lease cannot be renewed. The instance's liveness probe eventually fails (isolation detector) and kubelet restarts the container.

Failover interaction

The lease adds a 15-second wait to failover when the old primary is still renewing. The operator sequence becomes:

Detect primary failure, wait for failoverDelay.
Select a candidate.
Check the old primary's lease: if still held, requeue for 15 seconds.
Fence the old primary Pod.
Set targetPrimary and let the candidate promote.

If the old primary's Pod is deleted quickly (step 4), the lease is also released because the in-Pod shutdown handler fires. In that case the wait in step 3 is negligible. The lease guard is a backstop for when the Pod takes longer than 15 seconds to terminate.

Monitoring

The lease exists in the same namespace as the Cluster. Inspect it:

kubectl get lease <cluster>-primary -o yaml

The holderIdentity, renewTime, and leaseTransitions fields show which instance holds the lease and when it last renewed. An expired lease (renewTime older than 15 seconds with a different holderIdentity) means the holder stopped renewing and the lease is safe to take.

Deleting a cluster

Deleting a Cluster tears down its instances. Like CloudNativePG, the operator does not hold a finalizer to block the delete: kubectl delete cluster <cluster> removes the Pods immediately, and owned resources are garbage-collected via owner references.

To protect the data, rely on the standard Kubernetes mechanisms — set a Retain reclaim policy on the StorageClass (or retain the PVCs) so the volumes survive the Cluster deletion, and use RBAC to restrict who can delete Clusters.

Role services

cnmsql creates three default Services:

<cluster>-rw: selects the current primary.
<cluster>-ro: selects ready replicas.
<cluster>-r: selects any ready instance.

Default Services can be disabled by name (the rw service cannot be disabled):

spec:
  managed:
    services:
      disabledDefaultServices:
        - ro

Customising default services

A shared template is merged onto the three default Services, letting you change their type and add labels and annotations. The operator always owns the selector and the mysql:3306 port.

spec:
  managed:
    services:
      template:
        metadata:
          labels:
            app.kubernetes.io/part-of: my-app
          annotations:
            service.beta.kubernetes.io/aws-load-balancer-scheme: internal
        spec:
          type: LoadBalancer

Additional services

You can declare extra Services routed to a role (rw, ro, or r). Each entry is rendered as <cluster>-<name> and carries its own template. updateStrategy: patch (default) merges your template onto the role defaults; replace swaps them entirely, keeping only the operator-owned selector, ports, and owner-tracking labels. Additional service names must be unique and must not collide with the default rw/ro/r names.

spec:
  managed:
    services:
      additional:
        - name: mysql-lb
          selectorType: rw
          serviceTemplate:
            spec:
              type: LoadBalancer
        - name: mysql-internal-read
          selectorType: ro
          updateStrategy: replace
          serviceTemplate:
            metadata:
              labels:
                pool: reporting
            spec:
              type: ClusterIP

The user-customisable spec fields are type, externalTrafficPolicy, sessionAffinity, loadBalancerSourceRanges, externalName, and healthCheckNodePort. The selector, ports, and clusterIP are operator-managed and cannot be overridden. Per-instance headless Services remain internal ClusterIP: None and are not user-configurable.

Service routing is driven by Pod labels. The operator updates labels only after the database role change is considered safe, so client traffic follows the observed MySQL topology.

Dynamic role reconciliation

Every instance starts read only. The in-pod role reconciler watches the owning Cluster and compares its own name with status.targetPrimary and status.currentPrimary.

If the pod is the target primary, it drains replication state, promotes itself, clears read-only mode, and writes status.currentPrimary.

If the pod is not the target primary, it stays or becomes read only and follows the current primary. A diverged instance is kept read only and is not silently re-cloned over its retained PVC.

Planned switchover

A planned switchover promotes a named healthy replica. cnmsql models this like CloudNativePG: the request is a status transition rather than a spec change. The normal trigger is to set status.targetPrimary to a replica name.

The operator then:

validates that the target exists, is ready, is a replica, and has healthy replication threads;
checks that the target GTID set contains the old primary's observed GTID set;
demotes the old primary while it is still reachable;
lets the target promote itself;
reconfigures remaining replicas to follow the new primary;
updates role labels, Services, currentPrimary, and conditions.

spec.maxSwitchoverDelay bounds how long the target may take to catch up before the switchover is aborted and surfaced as blocked.

Automatic failover

Automatic failover begins when the established primary is unreachable or not healthy. The operator records when the primary first started failing and waits for spec.failoverDelay. A value of 0 means immediate failover.

After the delay, the operator selects a candidate only if it can prove the move is safe enough:

the candidate must be ready and a replica;
replication SQL state must be healthy and free of a last error;
the candidate must not be a known-diverged instance;
GTID sets must be comparable;
the chosen candidate must contain the best known GTID history among candidates.

If the GTID sets are divergent or no safe candidate exists, failover is blocked instead of risking data loss.

During failover, the old primary is fenced by removing its primary role and deleting its Pod while retaining the PVC. The promoted replica becomes currentPrimary, and surviving replicas follow it.

Excluding diverged candidates

A diverged replica carries errant transactions, so its GTID set is a superset of the clean replicas'. The "contains the best known GTID history" rule would therefore pick it, and promoting it makes those errant transactions canonical and strands the clean replicas. To prevent this, the operator filters known-diverged instances out of the candidate pool before the GTID comparison.

The subtlety is that divergence is detected by comparing each replica against the primary's GTID set, but during a failover the primary is unreachable, so there is no live baseline. The operator therefore consults status.divergedInstances as it was recorded on an earlier reconcile, while the primary was still reachable. That persisted signal survives the outage.

If every surviving candidate is known-diverged, failover blocks with "every replica candidate has diverged from the failed primary (errant transactions); manual recovery required". Re-initialise a survivor (see below) to recover. One case it cannot catch is a replica that diverged during the same outage, with no earlier reconcile to record it. That is the fundamental limit of MySQL replication once the old primary is gone to compare against.

Former primary rejoin

When a fenced or crashed primary returns, it does not automatically become primary again. It boots read only, observes Cluster status, and attempts to follow the promoted primary.

If its GTID set is contained in the new primary's GTID set, it can safely rejoin as a replica. If it contains errant transactions that the promoted primary never saw, cnmsql marks it diverged and keeps it out of service. The retained PVC is left for deliberate human recovery instead of being destroyed.

Detecting a silently broken replica

Divergence is detected by comparing GTID sets, but a replica can also stop replicating at the SQL layer without diverging. A duplicate-key conflict (error 1062), for example, halts the SQL thread with a recorded last error. Such a replica may still report Running, so it would otherwise sit unnoticed while falling further behind.

The operator polls the control API of every reachable instance, including pods that are Running but not yet Ready, and surfaces any replica with a stopped IO or SQL thread and a recorded error under status.replicationBrokenInstances. That marks the cluster Degraded with a reason naming the instance and its replication error, rather than leaving it to look like an instance still finishing provisioning. Re-initialise the instance to recover.

Re-initialising an instance

MySQL has no pg_rewind, so a diverged or irrecoverably broken replica cannot be surgically realigned. The remediation, which mirrors CloudNativePG's destroy-and-rebootstrap fallback, is to re-initialise the instance. The operator deletes its Pod and PVC and recreates them empty, so the bootstrap re-clones a fresh copy from a backup and rejoins replication. The instance keeps its name and ordinal, so it keeps its server_id; only its data is discarded.

This is always human-triggered. The operator never re-clones an instance over its retained PVC on its own, so errant data is preserved for diagnosis until you decide to discard it. Trigger it with the plugin or the annotation:

kubectl cnmsql reinit <cluster> <cluster>-2
# or, equivalently:
kubectl annotate cluster <cluster> \
  cnmsql.cnmsql.co/reinit=<cluster>-2

The annotation is a comma-separated list of instance names the operator consumes, tearing each down and recreating it before clearing the entry. The current primary is refused, because it is the replication source, so switch over first if you need to rebuild a former primary. See the operations runbook for the full procedure.

Status and events

Useful status fields during topology changes:

currentPrimary: the instance currently accepted as primary.
targetPrimary: the desired primary.
currentPrimaryTimestamp: when the current primary became primary.
targetPrimaryTimestamp: when a primary change was requested.
primaryFailingSince: when the current primary became unhealthy.
divergedInstances: instances excluded because their GTID set is unsafe.
replicationBrokenInstances: reachable replicas whose replication aborted with a recorded SQL/IO error.
fencedInstances: instances fenced out of routing, with mysqld stopped.
gtidExecutedByInstance: last observed GTID state per instance.

Watch Events for phase transitions such as switchover, failover, fencing, and blocked operations. The operator also reports Ready, Progressing, and Degraded conditions on the Cluster.

Operational notes

Use three instances for meaningful automatic failover.
Prefer semi-sync when the recovery objective requires acknowledged writes to survive primary loss.
Keep failover delay low for availability, but high enough to avoid promoting during transient node or network blips.
Do not manually write to a replica or recovered former primary. Errant GTIDs intentionally block automatic rejoin.
Do not delete retained PVCs until you have decided the data is no longer needed for diagnosis or manual recovery.

Verification coverage

Unit tests cover GTID parsing and containment, candidate selection, switchover validation, failover delay, blocked failover, role-label updates, and divergence detection. Kind e2e coverage validates planned switchover, automatic failover by primary Pod deletion, service rerouting, writes after promotion, and compatible former-primary rejoin.

Replication model​

Semi-sync self-healing (data durability)​

Probes​

Fencing an instance​

Primary lease fencing​

Feature gate​

Lifecycle​

Failover interaction​

Monitoring​

Deleting a cluster​

Role services​

Customising default services​

Additional services​

Dynamic role reconciliation​

Planned switchover​

Automatic failover​

Excluding diverged candidates​

Former primary rejoin​

Detecting a silently broken replica​

Re-initialising an instance​

Status and events​

Operational notes​

Verification coverage​