Main Page

From Thumper

Jump to: navigation, search

Thumper's Wiki

Herein you will find notes, ideas, and howtos of various things that I'm working on. Many pages are a "work-in-progress", which is why they are in wiki form. I encourage everyone to make edits that correct errors or provide information on their own experiences on a topic.

[edit] Featured Page

A cool project I'm currently working on is the Remotely Replacing Linux Root Disks page. The key problem we're trying to solve is how to replace the OS software on 40 machines in a painless way, so that we can regularly the machines. Each of the 40 is running Linux, and is essentially the same except for a few key configuration files and the software that gets run. (For instance, Apache only runs on the webservers, and BIND only on the DNS servers.)

There are a few different ways that installing the OS on 40 machines could be handled; e.g., using the Fully Automatic Installation project. An added complication was that we had a large set of unreliable hard drives (we had a failure every few months), so we would need to do installations quite often, but needed to minimize downtime of our servers. The solution we came up with was to install a single machine with a copy of all the software needed to run (but turned off), and use this disk image to clone disks for all the other machines. Using cfengine, we could implement the customizations for each machine as a short script that ran right when we replaced the root disk.

Fast forward a few years, and things have changed. We have exhausted the supply of unreliable drives, so we don't have an urgent need to physically replace drives. Instead, our pressing need is to keep software up-to-date on each machine, without breaking any of our services; controlled rollouts of updated software would be idea.

Enter the Remotely Replacing Linux Root Disks project. By booting off the network (using PXE Booting), we are able to download a new root image and write it directly to the root partition of the hard drive. One more reboot, and we're back to a "normal" running machine. This will allow us to "burn in" a version of the OS to make sure it works, then roll it out in a controlled fashion to each of our servers. It also allows us to easily rollout different architectures (we have new AMD64 machines).

Personal tools