Servers Hardware

Table of Contents

This article talks over how and why to do burn in testing on hypervisor and storage servers. I work at a fairly large cloud provider, where we have a lot of hardware. Think thousands of hardware servers and multiple ten thousand harddisks. It's all technology, so stuff breaks, and at our scale, stuff breaks often. One of my pet projects for the last period has been to automate the burn-in testing for our virtualisation servers and the storage machines. We run OpenStack and use KVM for the hypervisors and a combination of different storage technology for the volume storage servers. Before they go in production, they are tested for a few days with very intensive automated usage. We've noticed that they either fail then, or not. This saves us from having to migrate customers off of new production servers just a few days after they've gone live. The testing is of course all automated.

A very busy hypervisor node

Preface

This article is not a copy and paste tutorial. It's more a walkthrough of the processes used and the thought process behind it.

As said, I currently work at an OpenStack public cloud provider. That means that you can order a virtual server with us and get full administrative access to it. We use OpenStack, so you can also automate that part using the API and deploy instances automatically. Those virtual machines do need to run on actual hardware, which is where this article comes in.

The regular process for deploying new hardware is fully automated. We've got capacity management nailed down, once the OpenStack environment reaches a certain threshold, PDF's are automatically generated with investment requests and sent off to the finance department and the hardware is ordered. Then after some time, the nodes are shipped to the datacenters. Our datacenter team handles the racking and stacking. They setup the remote out of band management (ILO, iDrac, IPMI) and put the credentials into the PXE deployment servers (MaaS).

The machine get's installed with the required OS automatically and after that the software stack (OpenStack) is installed. The machine firmwares are then updated to the latest versions, the rest of the cluster configuration is updated and it's ready to go. This is all done using Ansible, after the racking and stacking no human is involved. The only thing I have to do is to enable the nova or cinder service in OpenStack for that machine, and it's ready to go.

The machine is automatically put into our monitoring system as well. We not only monitor the software side of things, like cpu load, disk usage, network connections, required services running, but also the hardware itself. Either via the remote out of band management or vendor provided tools (omreport anyone?). Which means that when a disk breaks, or a faulty memory module is found, our monitoring system alerts us and takes action automatically. When defective disks are detected, for example, the vendor automatically gets an RMA sent from our monitoring. Once a week a bunch of disks arrive at the office or the data center, depending on the vendor, and the datacenter team replace the faulty ones. Even the list with disks to replace by that team it automatically sent from the monitoring.

This level of automation is required when you reach a scale like this. By automating all of this, our sysadmin team can focus on other things than the gruntwork of installing software or ordering hardware. This level of automation and monitoring also provides a layer to build stuff on top of, which we will be doing here.

The Problem

Stuff breaks. It's technology, so just like your car, things break. That's not a problem if you've built your environment redundantly and highly available, but my experience is that not a lot of people do that.

Related posts: