Sunday morning (5-7AM UTC) is the schedule maintenance window for the Toolserver. During this time, we apply patches (software updates), and do any other work that might result in downtime. Now, patching a system is an inherently dangerous process: you’re changing the software running on the system, and there’s always a chance of new bugs being introduced. Of course, patches go through QA testing before being released, but sometimes problems still slip through. And, since many patches are large, and have to be installed in single user mode (where no services are running), even a harmless patch means downtime during installation.
Live Upgrade
At the Toolserver, we mitigate patching problems with a process known as Live Upgrade. Live Upgrade works by creating a copy of the running system (called a boot environment), patching the copy, then rebooting into the new system. There is no risk to the running system while patches are being installed – because they are applied to the alternate boot environment – which means the system can continue to run while patches are being installed. And, because the original boot environment is untouched, if any patches do cause problems, we can simply reboot into the old boot environment and diagnose the problem from there. Live Upgrade can even be used to upgrade Solaris itself to a new version – meaning downtime for an operating system upgrade is reduced to a single reboot. If the upgrade fails, simply reboot into the old boot environment.
There are some prequisites to use Live Upgrade. Most importantly, there needs to be free disk space to create the new boot environment on. There are two common ways to do this. #1, Split the root mirror, and #2, allocate unused disk space for the new environment.
#1: Split the root mirror
Most servers have mirrored root disks, so failure of one disk doesn’t cause downtime. If the root disk has been mirrored using Solaris Volume Manager (the software RAID functionality in Solaris), Live Upgrade has built-in support for creating a copy of the existing environment by detaching one side of the mirror. This turns a single device (a mirror of two disks) into two new devices: the individual disks. Because both sides of the mirror are already identical, this method also means there’s no need for a lengthy copy of the current root filesystem.
The downside of this method is that splitting the mirror means it’s no longer redundant. If the disk containing the new boot environment fails during the Live Upgrade process, it can’t be restarted until a replacement disk has been installed. Even worse, if the disk containing the current boot environment fails, the entire system will probably need to be reinstalled. Failure of a disk during a 30-minute window is perhaps unlikely, but the possibility needs to be considered.
On x86 systems, there’s another disadvantage to this method: since the new boot environment is on a different disk, the boot loader (e.g. GRUB) has to be relocated to the new disk, and the BIOS must be told to boot from this new disk. This is easy to forget, which causes surprising problems. (On SPARC systems, Live Upgrade can automatically change the default boot device in the system’s firmware.)
#2: Allocate unused space for Live Upgrade
In this method, the root mirror (if there is one) remains intact. Instead of detaching a device to create space, the system is partitioned at installed time to contain an additional unused slice which is the same size as the root slice. When Live Upgrade is invoked, it copies the current root filesystem into the unused partition. When it’s complete, the unused partition is the new root filesystem, and the existing partition becomes unused.
The most obvious downside of this method is that it needs more space. However, in a typical server configuration, two disks will be dedicated to the root filesystem. Disks these days are large enough (typically at least 72GB) that there’s easily enough space to spare. (We use 20GB root filesystems, which are around 50% used.) The upside, compared to method #1, is that no redundancy is lost during the Live Upgrade operation.
Using Live Upgrade
We use method #2 at the toolserver, for the reasons I mentioned above. So, how do we actually use Live Upgrade to apply patches?
The first step is to create the new boot environment. I’ll demonstrate this on yarrow, one of our database servers. Yarrow’s current root filesystem is /dev/md/dsk/d1. We’ll be creating the new boot environment on /dev/md/dsk/d7. This is pretty simple:
root@yarrow:~#lucreate -n be1 -m /:/dev/md/dsk/d7:ufs -x /aux0
This tells Live Upgrade to create a new boot environment, called be1, whose root (/) filesystem will be /dev/md/dsk/d7. After some time (around an hour), lucreate finishes, and the new environment is ready. Next, we need to apply the patches we want to it.
We use a tool called pca to download the patches we need. Since we want pca to operate on the new boot environment, we mount the environment’s filesystems, and use the -R switch to pca to make it operate on that filesystem.
root@yarrow:~#lumount be1 /.alt.be1 root@yarrow:~#pca -R /.alt.be1 -d Downloading xref file to /var/tmp/patchdiag.xref Trying http://sunsolve.sun.com/patchdiag.xref (1/1) [...]
(We could also use pca -l first, which would list the patches it’s going to download.)
After pca is finished (it produces more output while it’s running, which I haven’t shown here), the patches we need to install on be1 are now in /root/patches, because that’s where we configured pca to download them to. Next, we extract the patches (which come in .zip files):
root@yarrow:~root/patches#ls 119060-41.zip 124864-04.zip 126510-05.zip 127144-03.zip 127965-05.zip 137018-02.zip 138049-01.zip 119961-03.zip 125333-03.zip 127002-04.zip 127923-04.zip 128401-01.zip 137131-01.zip 138076-01.zip root@yarrow:~root/patches#for p in *.zip; do unzip -q $p; done root@yarrow:~root/patches#rm *.zip
Now we use luupgrade to apply these patches to the new boot environment:
root@yarrow:~root/patches#luupgrade -t -n be1 -s $PWD * Validating the contents of the media . The media contains 14 software patches that can be added. Mounting the BE . Adding patches to the BE . [...]
Lastly, we need to luactivate the new boot environment, so it becomes active at the next reboot:
root@yarrow:~root/patches#luactivate be1 [...] Activation of boot environment successful.
Now we reboot, and the new system comes up with our patches installed.
You can read more about Live Upgrade in the Solaris Live Upgrade and Upgrade Planning manual on docs.sun.com (it’s probably a good idea to read that before you try this at home!).