consul server upgrade
Upgrading a Consul cluster must follow the principle of “step by step, Servers before Clients, maintain compatibility”. If the version gap is too large (e.g., jumping from 1.1x directly to 1.2x), a direct upgrade may cause Raft protocol incompatibility or data corruption.
Below is a detailed upgrade roadmap and procedure:
1. Core Principles Before Upgrading
-
Version gap limit: For non-LTS versions, it is recommended to skip no more than 2 minor versions at a time (e.g., 1.18 → 1.20). For very old versions, follow the official intermediate version path for step-by-step upgrades.
-
Backup, backup, backup: Before touching any node, take a snapshot:
consul snapshot save backup_name.snap -
Order: Always upgrade Server nodes first, then Client nodes.
2. Server Cluster Rolling Upgrade
Server upgrades must maintain Quorum at all times, otherwise the cluster will enter an election loop.
-
Identify the current Leader: Use
consul operator raft list-peersto find the current Leader. -
Upgrade Follower nodes first:
-
Pick a Follower node and run
consul leave(graceful exit). -
Replace the old Consul binary.
-
Start the new version of Consul.
-
Verify health: Check logs and confirm the node has rejoined the cluster and synced its index (
commit_indexclose to the Leader’s). -
Repeat for all remaining Followers.
- Upgrade the Leader last:
-
Run
consul leaveon the old Leader. This forces a new election, promoting an already-upgraded Follower to Leader. -
Replace the old Leader’s binary and start it using the same procedure.
3. Upgrading in Kubernetes (Helm)
If you run Consul on K8s, you can use updatePartition for fine-grained rolling upgrades:
-
Update
values.yaml: Change the image version to the target version. -
Set partition update: For safety, set
server.updatePartition: 3(assuming 3 replicas) so Helm won’t automatically update all Pods at once. -
Roll manually: Decrease
updatePartitionone at a time (3 → 2 → 1 → 0), triggering each Server Pod to restart with the new version.
4. Common Troubleshooting
-
Split-brain / No Leader: If no Leader can be elected after the upgrade, check whether the
Raft Protocolversion is consistent across all nodes (useconsul info). -
Sync failure: If a new node fails to sync data, check disk space and whether permissions on the data directory (
data_dir) changed after the restart.