Troubleshooting

This page starts with symptoms and points to the first places to inspect. cnmsql surfaces most issues through Cluster/Backup status, Kubernetes Events, and the instance-manager logs.

First commands

kubectl cnmsql status <cluster>
kubectl cnmsql logs <cluster>
kubectl describe cluster <cluster>
kubectl get events --sort-by=.lastTimestamp
kubectl get backup
kubectl get scheduledbackup

Operator logs:

kubectl logs -n cnmsql-system deployment/cnmsql-controller-manager -c manager

Instance logs:

kubectl logs pod/<cluster>-1 -c manager

Cluster is not Ready

Check:

kubectl cnmsql status <cluster>
kubectl cnmsql logs <cluster>
kubectl describe pod <pod>

Common causes:

cert-manager has not produced TLS Secrets yet;
PVC is Pending due to storage class or capacity;
image pull failed;
unsupported Cluster shape is blocked by the controller;
instance-manager /status is unavailable;
initdb, restore, or join init container failed.

Look at status.phase, status.phaseReason, and Events first.

Replica will not join

Check the replica init container logs:

kubectl logs pod/<replica-pod> -c initdb

Common causes:

primary is not Ready yet;
mTLS material is missing or invalid;
source manager endpoint is unreachable;
XtraBackup stream failed;
target PVC already contains incompatible data;
MySQL version/image is incompatible with the source backup.

Replica provisioning uses XtraBackup over the existing instance-manager mTLS port. Network policies or service DNS issues can break the join path.

Replica is Running but not replicating

A replica whose Pod is Running but whose replication has stopped at the SQL layer (a halted IO/SQL thread with a recorded error, e.g. a duplicate-key conflict) is reported under status.replicationBrokenInstances and marks the cluster Degraded, even though the Pod looks healthy. The Degraded condition reason names the instance and its replication error.

Check:

kubectl cnmsql status <cluster>
kubectl get cluster <cluster> -o jsonpath='{.status.replicationBrokenInstances}'
kubectl logs pod/<replica-pod> -c manager

If the break is not transient, re-initialise the replica (see below) to re-clone it from a backup.

Diverged or broken replica recovery

A replica listed in status.divergedInstances (errant GTIDs) or status.replicationBrokenInstances (stopped replication) is held out of service. MySQL has no pg_rewind to realign it surgically, so the remediation is to re-initialise it. The operator deletes its Pod and PVC and re-clones a fresh copy from a backup, keeping the instance's name and server_id:

kubectl cnmsql reinit <cluster> <replica>

This is destructive: data only on that instance (including errant transactions) is lost. It is always human-triggered, and the current primary is refused, so switch over first if you need to rebuild a former primary. See the operations runbook.

Primary change is stuck

Inspect:

kubectl cnmsql status <cluster>

Common causes:

target replica is not healthy;
target GTID set does not contain the old primary's observed GTID set;
spec.maxSwitchoverDelay expired;
old primary could not be demoted or fenced;
a former primary returned with errant transactions.

Check status.currentPrimary, status.targetPrimary, status.targetPrimaryTimestamp, status.divergedInstances, and Events.

Automatic failover did not happen

cnmsql blocks failover when it cannot prove a safe candidate.

Check:

kubectl cnmsql status <cluster>

Likely explanations:

failover delay has not elapsed;
Kubernetes still reports the primary Pod as Ready;
no ready replica exists;
replication SQL state is unhealthy;
GTID sets are incomparable or divergent;
every surviving candidate is known-diverged (listed in status.divergedInstances), so promoting one would make errant transactions canonical; the blocked reason says "every replica candidate has diverged ... manual recovery required";
the only candidate is being deleted.

Failover should not be triggered solely by a temporary manager status endpoint failure while Kubernetes still routes the primary as Ready.

When failover is blocked because the only survivors are diverged, recover by re-initialising a survivor (see below) and letting it re-clone from a backup.

Backup failed

Inspect:

kubectl describe backup <backup>
kubectl get job <backup>-backup
kubectl logs job/<backup>-backup

Common causes:

missing object-store configuration;
missing or invalid S3 credentials;
no healthy backup source;
source instance-manager stream failed;
XtraBackup failed;
object-store upload failed.

The controller writes the backup phase, error, Job name, selected source instance, destination path, and conditions into Backup status.

ScheduledBackup did not create a Backup

Inspect:

kubectl describe scheduledbackup <scheduledbackup>
kubectl get backup -l mysql.cnmsql.co/scheduled-backup=<scheduledbackup>

Common causes:

spec.suspend: true;
invalid six-field cron expression;
a child Backup is still running, so the concurrency guard is deferring;
deterministic Backup name collision with a non-owned Backup;
first scheduled time has not arrived and immediate is false.

The schedule has six fields including seconds.

Continuous archiving is degraded

Inspect:

kubectl get cluster <cluster> -o jsonpath='{.status.continuousArchiving}'
kubectl describe cluster <cluster>

Common causes:

object-store endpoint or credentials are wrong;
primary cannot upload objects;
active binlog has not rotated yet;
object-store outage;
archiver cannot update manifests or _index.json;
purge guard is detecting lag.

PITR depends on the archive index and manifests, not just raw binlog objects.

PITR target is unsatisfiable

Common causes:

recovery target is before the base backup anchor;
target GTID or target time is beyond archived coverage;
_index.json is missing or stale;
required binlog segment or manifest was deleted;
archive has a forked or incoherent timeline.

Prefer targetGTID for exact recovery boundaries. targetTime depends on binlog event timestamps and server clocks.

Object-store data remains after deleting Backup

This is expected today. Deleting a Backup object does not delete backup.xbstream or metadata.json from the object store. Remote cleanup is a planned finalizer/retention feature.

Useful labels

mysql.cnmsql.co/cluster=<cluster>
mysql.cnmsql.co/instance=<instance>
mysql.cnmsql.co/role=primary|replica
mysql.cnmsql.co/scheduled-backup=<scheduledbackup>

These labels make it easier to list Pods, PVCs, Services, and generated Backups for one Cluster or schedule.

First commands​

Cluster is not Ready​

Replica will not join​

Replica is Running but not replicating​

Diverged or broken replica recovery​

Primary change is stuck​

Automatic failover did not happen​

Backup failed​

ScheduledBackup did not create a Backup​

Continuous archiving is degraded​

PITR target is unsatisfiable​

Object-store data remains after deleting Backup​

Useful labels​

First commands

Cluster is not Ready

Replica will not join

Replica is Running but not replicating

Diverged or broken replica recovery

Primary change is stuck

Automatic failover did not happen

Backup failed

ScheduledBackup did not create a Backup

Continuous archiving is degraded

PITR target is unsatisfiable

Object-store data remains after deleting Backup

Useful labels