A Comedy of Errors, or How to recover (hopefully) from overwriting an entire system (part 2)

Part 1 contains the setup for this. Basic recap:

need to recover reformatted RAID 6 containing LVM with 4 partitions (/, /opt, swap, /data) on single VG/single PV.

After I stopped panicking and got a hot beverage, I started googling for how to deal with this. I came across TestDisk, which is a very power piece of software for recovering data from hard drives in various states of distress.

I needed a boot disk (since I’d overwritten that…), and TestDisk has a list of various ones. I chose Alt Linux Rescue, somewhat randomly, and pulled down the ISO. I used dd to put the ISO on a USB stick from my Mac:

dd if=/path/to/downloaded/rescue.iso of=/dev/<disk number of USB stick> bs=1M 

Then I went to the server, crossed my fingers, and (finally) convinced it to boot from the USB stick. (BIOS/EFI on Dell Rx20 vintage servers are a pain. Rx30 BIOS/EFI are soooo much better, not to mention a LOT faster). Note that the server booted in BIOS mode as we are stick in the muds and haven’t moved to EFI. #sysadmins.

I used the first option on the Rescue boot screen, which is Rescue LiveCD. This boots up without mounting any disks it findsregular-rescue

Note, I was unable to get the DHCP selection (seen above) to work on this system.

After booting, I ran TestDisk. TestDisk has excellent documentation, which I am liberally cadging from here.  I started by selecting Create a log file.

I then chose the RAID array from the list of disks and selected Proceed.


Fortunately, it automatically recognized the partitioning as GPT and I was able to run Analyze on it.menus

At that point, it of course came back with the list of new partitions the kickstart had created. It did give me the option to do a Deeper Search. At this point I began to despair, because the Deeper Search, while finding all sorts of (possible?) partitions, was moving at a rather glacial rate. Do not despair at this point! I let it run for about 15-30 minutes and then gave up, hitting the Enter key to stop the search.

Let me say that again (since I didn’t figure it out until the second time around):


This will NOT quit the program and lose the list, as hitting, oh, ESC or Cntl-C or what have you will.  It will drop out to interactive, where you can press p to list the data on a theoretical partition. It helps at this point to know what the format of the partition and about how large it was.

In my case, I had a list of possible partitions a mile long, so I started at the top and went through them, pressing p to see what was there (typically nothing) and then q to return to the list of partitions. Note, if you press  too often, it will drop out of the partition list and you’ll have to do the scan again. Nothing will be lost, at least.

When I discovered a partition, I wrote down the start and end, and then returned to the partition list and used right arrow to change that partition from Deleted to Primary.

I did NOT inspect any of the LVM partitions I found. This is key (and surprising). 

The first partition I found, naturally, was /. I then found /opt, ignored the swaps, and went looking for the big /data partition (~23.5TB). That, however, was formatted xfs, and TestDisk can’t list files on XFS. So that was a guessing game. There were two XFS partitions. However, one of them started immediately after the end of the /opt partition I found (remember that writing down the Start and End?), so I took a gamble and chose that one to change from D to P.

Taking a deep breath after I had found the 3 partitions I cared about, I pressed Enter to proceed to the next step, which looks sort of like this:


I selected Write to write the recovered partition structure to disk. TestDisk told me I would have to reboot for the change to take effect. I backed out through the menus and attempted recovery on the SSD. Unfortunately, that didn’t work. However, since I had a backup of that, I didn’t really care much.

After reboot into TestDisk, I was presented with the option to mount the disks it found either read/write (normal) or read-only (forensic mode). I chose the forensic mode, and it mounted the partitions under /mnt. It indeed had the /, /opt, and /data, all of which had the correct files!

HOWEVER, they were not LVs any more. They had been converted to regular partitions, which was rather nice, since it simplified my life a great deal, not having to try to recover the LVs.

After verifying that it was there after a second reboot (and an aborted attempt to back up the 8TB of data on the disks–24 hrs was far too long on a rebuild I told the user would take a couple of hours), I bit the bullet and imaged the server using the correct kickstart/pxe combination.

At the disk partitioning screen, I was able to confirm that the 3 recovered partitions were present on the RAID6. I changed the mount points, set up the LVMs on the new boot RAID1, and ran the installation.

Unfortunately, it still didn’t boot.

It turns out that Dell PERC raid cards must have a single array designated as boot, and they will not boot if it is not so designated. This was doubly weird, because the MBR stub on the RAID6 was still present, and it kept trying to run grub, only to have (hd2,0) identified as the only one with a grub.conf on it.

Fix was in the BIOS (ok, fine UEFI console) under Integrated Devices->Dell PERC H710->Controller Configuration. Selected the third array, and I was in business!

After the fresh install booted up, my 3 recovered partitions were mounted where I had designated them in the installer.