1. The Challenge
We received a call from a customer who had a very specific need: The company wanted 88 dedicated hypervisors deployed in our Iceland-based cloud footprint. Furthermore, the client wanted a very specific hardware spec provisioned, which required a custom order to the manufacturer. This was all fine, but because it would take six weeks to physically build and ship the servers to Iceland, we wanted to minimize OS installation and configuration for the customer. Manually installing and configuring 88 servers to work in our private cloud would typically take somewhere between 10 minutes and an hour per server, so we were looking at potentially another two weeks before we could get everything online for the customer. Our COO thought some automation was in order here, and called on my colleagues and I to deliver.
2. The Approach
We decided to deploy something new and cutting edge: Razor, originally created by Nick Weaver. Then we handed the project over to Puppet Labs for ongoing development. We hadn’t used Razor before, and when we explained what we planned to do, our COO asked to see a proof-of-concept by Friday. It was Wednesday, July 3rd. We worked pretty hard through the 4th of July holiday and managed to have Razor deploying vanilla RHEL KVM and VMware ESXi 5.1 by Friday. It wasn’t perfect, but it was a working proof of concept.
3. Nitty Gritty Details
Razor works by running a DHCP server and using iPXE to boot a discovery image (Tinycore Linux) that has a customized version of Facter running on it. The discovery image loads and takes inventory of the node, then reports the inventory to the Razor server. Razor then consults its list of “wants” and issues an instruction to the node to start installing whatever OS its hardware profile is a match for. The “wants” are called policies and are set using a Puppet enterprise module. Here’s the sequence of events:
1. Using Puppet, Datapipe sets a policy on the Razor server that says, “Install ESXi on the next eight HP Proliant DL360 Gen 7 with dual hexcore CPUs that connect to me.”
2. Eight HP G7’s w/ dual hexcores are racked, cabled and powered on.
3. The eight G7’s boot the discovery image and report in as HP G7’s with dual hexcores.
4. Razor consults its policies, finds a need for eight G7’s running ESXi 5.1 and installs the eight new servers automatically, pulling IP and hostname info from our internal canonical database, setting IPs, bonds, and storage networks.
5. The servers are online and available for the customer to use.
4. Customizing Razor
The problem with using cutting-edge tools is that you can’t always perform a Google search to find a quick solution to problems that arise. While we waited for the servers to ship from the manufacturer, we developed some REST APIs to glue our inventory system to our canonical DB using Grape, a fast and easy Ruby API builder that is excellent for small tasks, to allow Razor to pull hostnames and IPs from the DB. We modified Razor’s templates for ESXi to work better in our environment. We got ahold of a few HP Proliant DL385 Gen8 AMD Opteron (2×16 cores) to test.
As it turned out, it was a good idea to obtain some of the same hardware the customer wanted. The specific Broadcomm chipset had a quirk that was hard to track down amidst all the PXE chaining and discovery and boot images: After a DHCP OFFER was sent by dnsmasq on behalf of Razor, the bnx chipset would sometimes wait more than 20 seconds to ACK. This resulted in a timeout. In order to fix this, we had to delve into the C code for ipxe and adjust with a custom timeout. Additionally, because we use 10 Gig dual nics in our cloud, sometimes the discovery image would load the qlcnic drivers for the 10g card and never get around to initializing the bnx2 drivers for the onboard nic. We took apart the discovery image ISO, added some fixes for this problem and we were all set when the servers arrived.
5. Puppetizing Every Aspect of Razor
Having a one-off unique flower of a Razor server wouldn’t be very dev-ops-y, would it? Razor makes use of the Linux filesystem, tftp server and git server, and therefore also uses xinetd server and sshd server, multiple networks, dnsmasq, DNS, rvm, mongodb, nodejs and more. Configuring all of this by hand leaves a lot of room for error, so we used community modules from the Puppet Forge and GitHub where we could, and wrote our own custom modules when they weren’t available elsewhere. Along the way, we discovered that the Puppet-network module didn’t support RHEL/CentOS, so we added support for that osfamily and wrote specs for it. You can find it on GitHub.
6. Installing 88 Servers with Razor
We took delivery of the custom compute and all the requisite cabling, and the DC went to work. Reflashing ROMs, setting iLO configs and connecting four network wires to four different switches, power supplies to two different PDUs, cabling the racks themselves and all the switches is no mean feat. Our team managed to get it all done in a couple of days. When we turned the machines on, I watched on the Razor server as the systems checked in, received their OS assignments and installed. Forty-five minutes later, all 88 machines were up and configured!
7. The Takeaway
This solution worked so well for our customer that we decided to deploy it around the world in every DC for every flavor of hypervisor we use. Today we have Razor installed in 12 data centers, provisioning VMware, KVM and ESXi 5.1 and 5.5 for both public and private cloud deployments. Servers can be deployed faster than the DC techs can get them physically mounted in the racks,and the wait time for compute is next to nothing.