Building a high available FreeBSD cluster

This describes how to build a highly available load balancer, dns server, or web server on FreeBSD, which will also run on VMware vSphere. It does not cover a database cluster nor a NFS cluster, where you have to synchronize your data amongst all nodes.

Base setup

I tested this howto on FreeBSD 14.2 using two virtual machines named node1 and node2. Relevant settings from /etc/rc.conf:

node1	node2
`hostname="node1.free.bsd"`	`hostname="node2.free.bsd"`
`ifconfig_vtnet0="inet 192.168.0.31 netmask 255.255.255.0"`	`ifconfig_vtnet0="inet 192.168.0.32 netmask 255.255.255.0"`

CARP

CARP allows multiple hosts to share one or more ip addresses. You have to load the carp kernel module and configure the virtual ip address as an alias:

node1	node2
`kld_list="carp"`	`kld_list="carp"`
`ifconfig_vtnet0_alias0="vhid 30 192.168.0.30/24 pass S3cr3t"`	`ifconfig_vtnet0_alias0="vhid 30 192.168.0.30/24 pass S3cr3t"`

By default, carp sends it announcements to the multicast ip address 244.0.0.18, which in turn resolves to the multicast mac address 01:00:5e:00:00:12. As such a multicast address is blocked on VMware, carp can use unicast mac addresses by sending its announcements to the peer node directly:

node1	node2
`ifconfig_vtnet0_alias0="vhid 30 192.168.0.30/24 pass S3cr3t peer 192.168.0.32"`	`ifconfig_vtnet0_alias0="vhid 30 192.168.0.30/24 pass S3cr3t peer 192.168.0.31"`

To activate these settings, either reboot or do:

# service kld restart
# service netif restart vtnet0
# service routing restart

The cluster ip address 192.168.0.30, which is shared by the two hosts, resolves to the mac address 00:00:5e:00:01:1e, which is from IANA's range for VRRP addresses. VRRP is the patented predecessor of CARP, and thus both protocols share several technical details. Problem is, that VMware drops packets destined for such mac addresses, too. Therefore, you have activate and deactivate an ip address, which resolves to a usual unicast mac address, whenever the node becomes master or backup for the (unusable) cluster ip address, respectively. Luckily, carp events are forwarded to devd. Thus, you can create the file /etc/devd/carp.conf on both nodes containing this configuration:

notify 0 {
  match  "system "   "CARP";
  match  "subsystem" "30@vtnet0";
  match  "type"      "MASTER";
  action "exec /sbin/ifconfig vtnet0 inet 192.168.0.29/24 alias";
};
notify 0 {
  match  "system"    "CARP";
  match  "subsystem" "30@vtnet0";
  match  "type"      "BACKUP";
  action "exec /sbin/ifconfig vtnet0 inet 192.168.0.29/24 -alias";
};

That makes 192.168.0.29 the actual cluster ip address, which your server daemons should listen on and your clients should talk to.

I experienced that carp is faster than a node's local arp cache and got kernel messages like xx:xx:xx:xx:xx:xx is using my IP address 192.168.0.29 on vtnet0! when carp changed its state. Thus, I added a small delay:

notify 0 {
  match  "system "   "CARP";
  match  "subsystem" "30@vtnet0";
  match  "type"      "MASTER";
  action "sleep 0.5; exec /sbin/ifconfig vtnet0 inet 192.168.0.29/24 alias";
};
notify 0 {
  match  "system"    "CARP";
  match  "subsystem" "30@vtnet0";
  match  "type"      "BACKUP";
  action "exec /sbin/ifconfig vtnet0 inet 192.168.0.29/24 -alias";
};

Since waiting a fixed amount of time before activating the cluster ip address on the local node may be inaccurate, you can try to ping that address upfront:

notify 0 {
  match  "system "   "CARP";
  match  "subsystem" "30@vtnet0";
  match  "type"      "MASTER";
  action "while ping -q -c 1 -t 1 192.168.0.29; do sleep 0.5; done; exec /sbin/ifconfig vtnet0 inet 192.168.0.29/24 alias";
};
notify 0 {
  match  "system"    "CARP";
  match  "subsystem" "30@vtnet0";
  match  "type"      "BACKUP";
  action "exec /sbin/ifconfig vtnet0 inet 192.168.0.29/24 -alias";
};

To activate these settings, either reboot or do:

# service devd restart

Gratuitous ARP

When configuring a new ip address on a local network interface, FreeBSD sends one gratuitous ARP packet for that address by default. If just one gratuitous arp is not enough to update the arp caches of all hosts connected to the local network segment (read: the broadcast domain), you can increase the number of additional arp packets via a sysctl setting, e.g. for 2 additional gratuitous arp packets, 3 in total:

# sysctl net.link.ether.inet.garp_rexmit_count=2

By default, garp_rexmit_count is zero. To have it set during system boot, add it to /etc/sysctl.conf.

Service

I installed BIND on both nodes as an example for a service that does not need to have its state nor its data synchronized amongst all nodes. Installation is very simple. Either

# make -C /usr/ports/dns/bind9.20 install clean

# pkg install bind9.20

I configured BIND to listen on any locally configured ip address in the 192.168.0.0/24 range. On both nodes, edit /usr/local/etc/named/named.conf and add that range to the listen-on statement in the options block:

...
options {
        ...
        listen-on       { 127.0.0.1; 192.168.0.0/24; };
        ...
};
...

Of course you have to enable BIND in /etc/rc.conf:

# sysrc named_enable="YES"

Either reboot or start named manually:

# service named start

Monitoring

If your service, e.g. named, goes down on, then carp won't notice. If that happens on the master node, all queries to the cluster ip address (192.168.0.29) won't be handled. Thus, put watchdog.sh to /usr/local/sbin, make it executable, and create /etc/cron.d/carp:

* * * * * root /usr/local/sbin/watchdog.sh named

As mentioned in the comments section at the top of watchdog.sh, add this to /etc/sysctl.conf to allow carp to forcefully change a node's role to master, but not during system boot when named might not be running yet:

net.inet.carp.preempt=1
net.inet.carp.demotion=150

Reload these settings:

# service sysctl restart

Unprivileged listen()

At least named monitors all local network interfaces for ip address changes. If an ip address appears that falls within one of the configured listen-on ranges, named starts receiving packets on that particular address for port 53 by default. Since named drops root privileges after its initialization, it may not accept() nor recvfrom() packets on port 53. You can circumvent this using at least one of the following four methods:

Run named as privileged user with uid 0, i.e. root. This may break security if named contains a remotely exploitable bug.
Let named listen on a port number higher than 1023, e.g. 8053. This breaks almost any client.
Allow unprivileged users to listen on ports higher than 52 by lowering the sysctl setting net.inet.ip.portrange.reservedhigh to 52. This may break security, too, and prevent crucial services like ntpd or syslogd from starting if an unprivileged user is already listening on a port lower than 1024.
Use mac_portacl to allow only certain unprivileged users to listen on ports below 1024. This is explained in the following paragraph.

This sysctl setting will allow the user id 53 to use local TCP and UDP port 53:

security.mac.portacl.rules=uid:53:tcp:53,uid:53:udp:53

The ports protected by mac_portacl must not be covered by the range given by net.inet.ip.portrange.reservedlow and net.inet.ip.portrange.reservedhigh. To force the use of mac_portacl for any low ports, add this to /etc/sysctl.conf as well:

net.inet.ip.portrange.reservedlow=0
net.inet.ip.portrange.reservedhigh=0

Not strictly neccesary, but these two settings explicitly allow the root user to use any local port number, and prevent mac_portacl from protecting port numbers above 1023. It is the default, but I added them to sysctl.conf just to be sure:

security.mac.portacl.suser_exempt=1
security.mac.portacl.port_high=1023

To load mac_portacl during system boot, add it to the list of kernel modules:

# sysrc kld_list+=mac_portacl

Now either reboot, or load and enforce mac_portacl manually:

# service kld restart
# service sysctl restart