Changing Proxmox cluster IP on live environment

Linux Virtualization

There's a sentence in the proxmox cluster setup guide: Changing the hostname and IP is not possible after cluster creation. We are running a 4 node cluster that was placed in a legacy network. As the company grew from few people to something bigger, I wanted to split the network into VLANs for improved security and management, but taking the cluster down for many hours wasn't really the option. It turned out the cluster members can be renamed and moved to a different subnet easily without affecting the running VMs.

Background

The proxmox cluster is using the corosync process to synchronize some vital configuration a commands between cluster members. To avoid corrupting anything, it comes with a concept of quorum, it's like voting, when something needs to be changed, each member of the cluster votes for the change, once the amount of votes is higher than half of the cluster members, the change is made, the cluster is said to be quorate in this case.

This way no race conditions are possible, e.g. when one member looses connection to rest of the cloud, it has only one vote, it's not quorate on it's own and cannot made any changes, the rest of the cluster will work normally and once the connection is back, the changes from the cluster are propagated back to this member. However, when the cluster loses the quorum (to many nodes are down), no changes to cluster are possible until the nodes are back.

Imagine a two node cluster, once one member is down, the quorum is lost and we can't change anything. But what to do when we need to do changes and we are sure the other member is down and won't mess with the configuration in the meanwhile? It's possible to temporary increase amount of votes for the node, do the changes and decrease it back, once the other node goes live, it will fetch the changes, but you should know what you are doing to avoid damaging the cluster.

To see the cluster members state, current amount of votes and quorum, run:

pvecm status 

It will generate similar output:

hp2# pvecm status
Quorum information
~~~~~~~~~~~~~~~~~~
Date:             Mon Apr 20 12:30:13 2015
Quorum provider:  corosync_votequorum
Nodes:            4
Node ID:          0x00000001
Ring ID:          1/8
Quorate:          Yes

Votequorum information
~~~~~~~~~~~~~~~~~~~~~~
Expected votes:   4
Highest expected: 4
Total votes:      4
Quorum:           3
Flags:            Quorate

Membership information
~~~~~~~~~~~~~~~~~~~~~~
    Nodeid      Votes Name
0x00000001          1 192.168.15.91
0x00000002          1 192.168.15.92 (local)
0x00000003          1 192.168.15.93
0x00000004          1 192.168.15.94

As you can see, we have 4 members in the cluster, all of them are online (we have 4 votes), minimum amount of members to be quorate is 3 and we are quorate.

Before you touch anything, it's important to know how the corosync service works. There are two config files, the /etc/pve/corosync.conf and /etc/corosync/corosync.conf.The corosync keeps the content of the /etc/pve directory synced between the nodes, once you change any file in there, it changes immediately on all the nodes. The corosync.conf file is event more special, once you save it, the changes made in there are immediately applied to corosync process if possible and local file /etc/corosync/corosync.conf is automatically overwritten. If you change the local /etc/corosync/corosync.conf, nothing is propagated, but you shouldn't do that - reboot the node and the local file will be used, breaking the cloud potentially.

If you need to edit the corosync config, it's best to copy the /etc/pve/corosync.conf somewhere else, edit it and once you are sure it's ok, copy it back, overwriting the previous version. If you edit it directly, don't save the file until you are done and sure it's ok, saving it several times during editing could break things as it's immediately synced and used after saving.

Changing the ip/hostname

There's a guide on changing the cluster IPs on the forum, however, the guide requires rebooting all the nodes and taking the cluster completely down for a while, which is not the best thing to do on a production when you want to avoid a downtime. I managed to do it one by one without affecting anything. Let's start with a single node, first migrate all the VMs to a different node as we'll need to reboot the machine after changes.

Next, we need to update the /etc/pve/corosync.conf file. The changes are propagated immediately when file is saved, either save after all changes are done, or edit copy and move copy over the original file. Open the file, find block defining your node and change it's IP/Hostname as needed, also increase config_version in totem section. The file looks like:

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: br-srv-virt01
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.0.1.10
  }
  node {
    name: br-srv-virt02
    nodeid: 2
    quorum_votes: 1
    ring0_addr: 10.0.1.11
  }
  node {
    name: br-srv-virt03
    nodeid: 3
    quorum_votes: 1
    ring0_addr: 10.0.1.12
  }
  node {
    name: br-srv-virt04
    nodeid: 4
    quorum_votes: 1
    ring0_addr: 10.0.1.13
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: br-cluster
  config_version: 10
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

Congrats, now you broke your cluster! It's time to fix it, log into node of which IP you are changing and update the /etc/hosts and /etc/network/interfaces records with your new IP address (or if you are changing your hostname, use the hostname-ctl command to change it). Check if the local /etc/corosync/corosync.conf was updated accordingly, if not, change it manually to match content of other hosts, else the node won't join the cluster afterwards. Once all changes are done, reboot the node (or spend some time restarting all the necessary services, .

Once the node is up, you can check the pvecm status output, it should show the node is up, has correct IP, but cannot see other nodes and is not quorate. You can check bunch of errors generated by running journalctl -u corosync, but it's much easier to get adventurous and run systemctl restart corosync on all other nodes than trying to make sense of them. After rebooting corosync services on all nodes, the node with new IP should join the cluster in few seconds and work as expected!

If you are changing IPs/Hostnames of all nodes, just repeat the steps on each node. If you can take the whole cluster down for a while, just change the corosync config for all of them at once, update IPs on all nodes and reboot all of them.

Previous Post Next Post