A Comedy of Errors, or How to recover (hopefully) from overwriting an entire system (part 1)

Here’s hoping this’ll help someone else.

Yesterday, I was tasked with rebuilding the OS on a server that is fairly important and holds a lot of data that the developers who run it take responsibility for backing up.

Normally, we would toss rebuilds over to a less senior member of the team, but because of the abnormal requirements, my boss gave it to me to do.

In our environment, we typically nuke and pave servers when we rebuild them. Generally all data is kept on shared storage, so no big deal. In this case, the developers wanted to build their own storage. Fine, sez us, buy a Dell R720xd, and we’ll put our standard Scientific Linux 6.x build on it, and have fun. This was a few years ago, and I had specced it out with 8 4TB 7200RPM SAS drives and no separate boot drives to hit their budget number. (mistake #1)

But a few days ago, one of them had too much fun, and managed to torch the rpm database. My boss worked on it for several hours, and we came to the conclusion it needed to be rebuilt. Enter me. I backed up the logging SSD with a few hundred gigs of data on it, backed up the one directory in /opt that the dev said they needed, and the dev told us not to bother backing up the 8TB of data in the 24TB partition of the RAID6 that held both OS and data. The dev assured us he had taken a backup last night.

My plan was to put in a couple of new drives in the R720xd and put the OS on those, and then later expand the /data partition over the old OS partitions (/boot, /, swap, and /opt)

We image servers using pxe and kickstart, and with a few exceptions, our kickstarts set up to erase the MBR, create new partitions, and put LVM volumes on them before starting the install. We have a few outliers which are set up to ignore a certain drive, or to do manual partitioning.

What we didn’t have was a Scientific Linux 6.7 kickstart that did either. So I copied over a 6.6 one, did a search/replace for 6.6/6.7, and had me a 6.7 normal kickstart. Copied that, commented out all the formatting/erasing lines, and Bob’s your uncle.

When I went to change the pxeboot files, that’s where I ran into trouble. My coworker who used to maintain this stuff recently left, and I was a titch fuzzy on how to set up the pxeboot stuff. So I copied over a 6.6, did the same thing as above, and I figured I was good. Here’s where I screwed up. In my manual partitioning pxelinux.cfg for 6.7, I forgot to actually call the manual partitioning kickstart. DUR.

I fire off the deployment script and wander out to the datacenter to do the manual partitioning. To my horror, I see the script erasing /dev/sdc, /dev/sdd, /dev/sde… and creating new LVMs. I hit ctl-alt-del before it started writing actual data to the drives, but not before it had finished the repartitioning and LVM creation.

So to recap:

Server had 4 LVMs set up for / (ext4), /opt (ext4), /data (xfs, and that’s the important one) and swap on a hardware RAID 6, set up with GPT on a single physical volume, single volume group.

The kickstart overwrote the GPT table and then created /boot, and new LVMs (/, a very large /opt, swap) in a single physical volume/single volume group. It also overwrote the SSD (separate volume) and probably the new disks I put in for the OS.

Realizing that recovery from the backup of /data was possibly a non-starter, my boss and I decided the best thing for me to do was try to recover the data in place on the RAID 6.

On to part 2…


About kcarlile
Twitter: @overclockdlemon

One Response to A Comedy of Errors, or How to recover (hopefully) from overwriting an entire system (part 1)

  1. Pingback: A Comedy of Errors, or How to recover (hopefully) from overwriting an entire system (part 2) | Unscrupulous Modifier

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: