Troubleshooting
This page starts with symptoms and points to the first places to inspect. cnmsql surfaces most issues through Cluster/Backup status, Kubernetes Events, and the instance-manager logs.
First commands
kubectl cnmsql status <cluster>
kubectl cnmsql logs <cluster>
kubectl describe cluster <cluster>
kubectl get events --sort-by=.lastTimestamp
kubectl get backup
kubectl get scheduledbackup
Operator logs:
kubectl logs -n cnmsql-system deployment/cnmsql-controller-manager -c manager
Instance logs:
kubectl logs pod/<cluster>-1 -c manager
Cluster is not Ready
Check:
kubectl cnmsql status <cluster>
kubectl cnmsql logs <cluster>
kubectl describe pod <pod>
Common causes:
- cert-manager has not produced TLS Secrets yet;
- PVC is Pending due to storage class or capacity;
- image pull failed;
- unsupported Cluster shape is blocked by the controller;
- instance-manager
/statusis unavailable; - initdb, restore, or join init container failed.
Look at status.phase, status.phaseReason, and Events first.
Replica will not join
Check the replica init container logs:
kubectl logs pod/<replica-pod> -c initdb
Common causes:
- primary is not Ready yet;
- mTLS material is missing or invalid;
- source manager endpoint is unreachable;
- XtraBackup stream failed;
- target PVC already contains incompatible data;
- MySQL version/image is incompatible with the source backup.
Replica provisioning uses XtraBackup over the existing instance-manager mTLS port. Network policies or service DNS issues can break the join path.
Replica is Running but not replicating
A replica whose Pod is Running but whose replication has stopped at the SQL layer
(a halted IO/SQL thread with a recorded error, e.g. a duplicate-key conflict) is
reported under status.replicationBrokenInstances and marks the cluster
Degraded, even though the Pod looks healthy. The Degraded condition reason
names the instance and its replication error.
Check:
kubectl cnmsql status <cluster>
kubectl get cluster <cluster> -o jsonpath='{.status.replicationBrokenInstances}'
kubectl logs pod/<replica-pod> -c manager
If the break is not transient, re-initialise the replica (see below) to re-clone it from a backup.
Diverged or broken replica recovery
A replica listed in status.divergedInstances (errant GTIDs) or
status.replicationBrokenInstances (stopped replication) is held out of service.
MySQL has no pg_rewind to realign it surgically, so the remediation is to
re-initialise it. The operator deletes its Pod and PVC and re-clones a fresh
copy from a backup, keeping the instance's name and server_id:
kubectl cnmsql reinit <cluster> <replica>
This is destructive: data only on that instance (including errant transactions) is lost. It is always human-triggered, and the current primary is refused, so switch over first if you need to rebuild a former primary. See the operations runbook.
Primary change is stuck
Inspect:
kubectl cnmsql status <cluster>
Common causes:
- target replica is not healthy;
- target GTID set does not contain the old primary's observed GTID set;
spec.maxSwitchoverDelayexpired;- old primary could not be demoted or fenced;
- a former primary returned with errant transactions.
Check status.currentPrimary, status.targetPrimary,
status.targetPrimaryTimestamp, status.divergedInstances, and Events.
Automatic failover did not happen
cnmsql blocks failover when it cannot prove a safe candidate.
Check:
kubectl cnmsql status <cluster>
Likely explanations:
- failover delay has not elapsed;
- Kubernetes still reports the primary Pod as Ready;
- no ready replica exists;
- replication SQL state is unhealthy;
- GTID sets are incomparable or divergent;
- every surviving candidate is known-diverged (listed in
status.divergedInstances), so promoting one would make errant transactions canonical; the blocked reason says "every replica candidate has diverged ... manual recovery required"; - the only candidate is being deleted.
Failover should not be triggered solely by a temporary manager status endpoint failure while Kubernetes still routes the primary as Ready.
When failover is blocked because the only survivors are diverged, recover by re-initialising a survivor (see below) and letting it re-clone from a backup.
Backup failed
Inspect:
kubectl describe backup <backup>
kubectl get job <backup>-backup
kubectl logs job/<backup>-backup
Common causes:
- missing object-store configuration;
- missing or invalid S3 credentials;
- no healthy backup source;
- source instance-manager stream failed;
- XtraBackup failed;
- object-store upload failed.
The controller writes the backup phase, error, Job name, selected source instance, destination path, and conditions into Backup status.
ScheduledBackup did not create a Backup
Inspect:
kubectl describe scheduledbackup <scheduledbackup>
kubectl get backup -l mysql.cnmsql.co/scheduled-backup=<scheduledbackup>
Common causes:
spec.suspend: true;- invalid six-field cron expression;
- a child Backup is still running, so the concurrency guard is deferring;
- deterministic Backup name collision with a non-owned Backup;
- first scheduled time has not arrived and
immediateis false.
The schedule has six fields including seconds.
Continuous archiving is degraded
Inspect:
kubectl get cluster <cluster> -o jsonpath='{.status.continuousArchiving}'
kubectl describe cluster <cluster>
Common causes:
- object-store endpoint or credentials are wrong;
- primary cannot upload objects;
- active binlog has not rotated yet;
- object-store outage;
- archiver cannot update manifests or
_index.json; - purge guard is detecting lag.
PITR depends on the archive index and manifests, not just raw binlog objects.
PITR target is unsatisfiable
Common causes:
- recovery target is before the base backup anchor;
- target GTID or target time is beyond archived coverage;
_index.jsonis missing or stale;- required binlog segment or manifest was deleted;
- archive has a forked or incoherent timeline.
Prefer targetGTID for exact recovery boundaries. targetTime depends on
binlog event timestamps and server clocks.
Object-store data remains after deleting Backup
This is expected today. Deleting a Backup object does not delete
backup.xbstream or metadata.json from the object store. Remote cleanup is a
planned finalizer/retention feature.
Useful labels
mysql.cnmsql.co/cluster=<cluster>
mysql.cnmsql.co/instance=<instance>
mysql.cnmsql.co/role=primary|replica
mysql.cnmsql.co/scheduled-backup=<scheduledbackup>
These labels make it easier to list Pods, PVCs, Services, and generated Backups for one Cluster or schedule.