A Comedy of Errors, or How to recover (hopefully) from overwriting an entire system (part 2)

Part 1 contains the setup for this. Basic recap:

need to recover reformatted RAID 6 containing LVM with 4 partitions (/, /opt, swap, /data) on single VG/single PV.

After I stopped panicking and got a hot beverage, I started googling for how to deal with this. I came across TestDisk, which is a very power piece of software for recovering data from hard drives in various states of distress.

I needed a boot disk (since I’d overwritten that…), and TestDisk has a list of various ones. I chose Alt Linux Rescue, somewhat randomly, and pulled down the ISO. I used dd to put the ISO on a USB stick from my Mac:

dd if=/path/to/downloaded/rescue.iso of=/dev/<disk number of USB stick> bs=1M 

Then I went to the server, crossed my fingers, and (finally) convinced it to boot from the USB stick. (BIOS/EFI on Dell Rx20 vintage servers are a pain. Rx30 BIOS/EFI are soooo much better, not to mention a LOT faster). Note that the server booted in BIOS mode as we are stick in the muds and haven’t moved to EFI. #sysadmins.

I used the first option on the Rescue boot screen, which is Rescue LiveCD. This boots up without mounting any disks it findsregular-rescue

Note, I was unable to get the DHCP selection (seen above) to work on this system.

After booting, I ran TestDisk. TestDisk has excellent documentation, which I am liberally cadging from here.  I started by selecting Create a log file.

I then chose the RAID array from the list of disks and selected Proceed.

partition_table_type

Fortunately, it automatically recognized the partitioning as GPT and I was able to run Analyze on it.menus

At that point, it of course came back with the list of new partitions the kickstart had created. It did give me the option to do a Deeper Search. At this point I began to despair, because the Deeper Search, while finding all sorts of (possible?) partitions, was moving at a rather glacial rate. Do not despair at this point! I let it run for about 15-30 minutes and then gave up, hitting the Enter key to stop the search.

Let me say that again (since I didn’t figure it out until the second time around):

HIT THE ENTER KEY TO END THE DEEPER SEARCH.

This will NOT quit the program and lose the list, as hitting, oh, ESC or Cntl-C or what have you will.  It will drop out to interactive, where you can press p to list the data on a theoretical partition. It helps at this point to know what the format of the partition and about how large it was.

In my case, I had a list of possible partitions a mile long, so I started at the top and went through them, pressing p to see what was there (typically nothing) and then q to return to the list of partitions. Note, if you press  too often, it will drop out of the partition list and you’ll have to do the scan again. Nothing will be lost, at least.

When I discovered a partition, I wrote down the start and end, and then returned to the partition list and used right arrow to change that partition from Deleted to Primary.

I did NOT inspect any of the LVM partitions I found. This is key (and surprising). 

The first partition I found, naturally, was /. I then found /opt, ignored the swaps, and went looking for the big /data partition (~23.5TB). That, however, was formatted xfs, and TestDisk can’t list files on XFS. So that was a guessing game. There were two XFS partitions. However, one of them started immediately after the end of the /opt partition I found (remember that writing down the Start and End?), so I took a gamble and chose that one to change from D to P.

Taking a deep breath after I had found the 3 partitions I cared about, I pressed Enter to proceed to the next step, which looks sort of like this:

menu_write

I selected Write to write the recovered partition structure to disk. TestDisk told me I would have to reboot for the change to take effect. I backed out through the menus and attempted recovery on the SSD. Unfortunately, that didn’t work. However, since I had a backup of that, I didn’t really care much.

After reboot into TestDisk, I was presented with the option to mount the disks it found either read/write (normal) or read-only (forensic mode). I chose the forensic mode, and it mounted the partitions under /mnt. It indeed had the /, /opt, and /data, all of which had the correct files!

HOWEVER, they were not LVs any more. They had been converted to regular partitions, which was rather nice, since it simplified my life a great deal, not having to try to recover the LVs.

After verifying that it was there after a second reboot (and an aborted attempt to back up the 8TB of data on the disks–24 hrs was far too long on a rebuild I told the user would take a couple of hours), I bit the bullet and imaged the server using the correct kickstart/pxe combination.

At the disk partitioning screen, I was able to confirm that the 3 recovered partitions were present on the RAID6. I changed the mount points, set up the LVMs on the new boot RAID1, and ran the installation.

Unfortunately, it still didn’t boot.

It turns out that Dell PERC raid cards must have a single array designated as boot, and they will not boot if it is not so designated. This was doubly weird, because the MBR stub on the RAID6 was still present, and it kept trying to run grub, only to have (hd2,0) identified as the only one with a grub.conf on it.

Fix was in the BIOS (ok, fine UEFI console) under Integrated Devices->Dell PERC H710->Controller Configuration. Selected the third array, and I was in business!

After the fresh install booted up, my 3 recovered partitions were mounted where I had designated them in the installer.

 

 

A Comedy of Errors, or How to recover (hopefully) from overwriting an entire system (part 1)

Here’s hoping this’ll help someone else.

Yesterday, I was tasked with rebuilding the OS on a server that is fairly important and holds a lot of data that the developers who run it take responsibility for backing up.

Normally, we would toss rebuilds over to a less senior member of the team, but because of the abnormal requirements, my boss gave it to me to do.

In our environment, we typically nuke and pave servers when we rebuild them. Generally all data is kept on shared storage, so no big deal. In this case, the developers wanted to build their own storage. Fine, sez us, buy a Dell R720xd, and we’ll put our standard Scientific Linux 6.x build on it, and have fun. This was a few years ago, and I had specced it out with 8 4TB 7200RPM SAS drives and no separate boot drives to hit their budget number. (mistake #1)

But a few days ago, one of them had too much fun, and managed to torch the rpm database. My boss worked on it for several hours, and we came to the conclusion it needed to be rebuilt. Enter me. I backed up the logging SSD with a few hundred gigs of data on it, backed up the one directory in /opt that the dev said they needed, and the dev told us not to bother backing up the 8TB of data in the 24TB partition of the RAID6 that held both OS and data. The dev assured us he had taken a backup last night.

My plan was to put in a couple of new drives in the R720xd and put the OS on those, and then later expand the /data partition over the old OS partitions (/boot, /, swap, and /opt)

We image servers using pxe and kickstart, and with a few exceptions, our kickstarts set up to erase the MBR, create new partitions, and put LVM volumes on them before starting the install. We have a few outliers which are set up to ignore a certain drive, or to do manual partitioning.

What we didn’t have was a Scientific Linux 6.7 kickstart that did either. So I copied over a 6.6 one, did a search/replace for 6.6/6.7, and had me a 6.7 normal kickstart. Copied that, commented out all the formatting/erasing lines, and Bob’s your uncle.

When I went to change the pxeboot files, that’s where I ran into trouble. My coworker who used to maintain this stuff recently left, and I was a titch fuzzy on how to set up the pxeboot stuff. So I copied over a 6.6, did the same thing as above, and I figured I was good. Here’s where I screwed up. In my manual partitioning pxelinux.cfg for 6.7, I forgot to actually call the manual partitioning kickstart. DUR.

I fire off the deployment script and wander out to the datacenter to do the manual partitioning. To my horror, I see the script erasing /dev/sdc, /dev/sdd, /dev/sde… and creating new LVMs. I hit ctl-alt-del before it started writing actual data to the drives, but not before it had finished the repartitioning and LVM creation.

So to recap:

Server had 4 LVMs set up for / (ext4), /opt (ext4), /data (xfs, and that’s the important one) and swap on a hardware RAID 6, set up with GPT on a single physical volume, single volume group.

The kickstart overwrote the GPT table and then created /boot, and new LVMs (/, a very large /opt, swap) in a single physical volume/single volume group. It also overwrote the SSD (separate volume) and probably the new disks I put in for the OS.

Realizing that recovery from the backup of /data was possibly a non-starter, my boss and I decided the best thing for me to do was try to recover the data in place on the RAID 6.

On to part 2…

Running Spark as a job on a Grid Engine HPC Cluster (part 1)

SEE THE UPDATED GUIDE TO RUNNING ON GRID ENGINE HERE: SparkFlex, aka running Spark jobs on a Grid Engine cluster, take II

NOTE: SOME SETTINGS BELOW ARE OBSOLETE AND NEED TO BE UPDATED

Apache Spark has become a pretty big thing where I work. We were originally approached about running it on our HPC cluster about 3 years ago, and, knowing little to nothing about Big Data clusters, I agreed to set it up and get it rolling. Over the last 3 years, we have been gradually improving the process and getting it more and more under the control of execd, which makes it a bit more resilient to crashes and memory overruns.

The first thing that needs to be done is to make sure that all the prerequisites are present for Spark. We use Java 1.7.x, Scala 2.10.x, and Python 2.7.6. We do not use HDFS for the storage; we just use our usual Isilon NFS shares (mounted with nolock, of course) and occasionally GPFS. We use the precompiled version of Spark for Hadoop 1.x, since we happen to have CDH 3.x MapReduce installed on the cluster, although we never use it.

There are two main configuration files for Spark: spark-env.sh and spark-defaults.conf. Both need to be modified for the defaults to fit your environment. As an example, here are ours:

spark-env.sh:

ulimit -n 65535 export SCALA_HOME=/usr/local/scala-2.10.3 
export SPARK_WORKER_DIR=/scratch/spark/work 
export JAVA_HOME=/usr/local/jdk1.7.0_67 
export SPARK_LOG_DIR=~/.spark/logs/$JOB_ID/ 
export SPARK_EXECUTOR_MEMORY=90g 
export SPARK_DRIVER_MEMORY=50g 
export SPARK_WORKER_MEMORY=90g 
export SPARK_LOCAL_DIRS=/scratch/spark/tmp 
export PYSPARK_PYTHON=/usr/local/python-2.7.6/bin/python 
export SPARK_SLAVES=/scratch/spark/tmp/slaves 
export SPARK_SSH_OPTS="-o StrictHostKeyChecking=no -o ConnectTimeout=30"

Our nodes are 128GB/16 cores, so your mileage and settings may vary. /scratch is a local directory on the nodes. Note that you will want to have some kind of clean up script for your work and tmp directories. Spark will normally clean up behind itself, but if it exits abnormally, all bets are off.

 

spark-defaults.conf:

spark.akka.timeout=300 
spark.storage.blockManagerHeartBeatMs=30000 
spark.akka.retry.wait=30 
spark.akka.frameSize=2047 
spark.local.dir=/scratch/spark/tmp
spark.kryoserializer.buffer.max.mb=1024
spark.core.connection.ack.wait.timeout=600
spark.driver.maxResultSize=0

These are, for the most parts, settings I made many, many versions of Spark ago. Other than the local.dir, I’m afraid I don’t have much info about them. But this one is important. NEVER place the spark.local.dir on locking storage, because the workers will collide with each other.

In the next article, I’ll talk about the setup of Grid Engine to accommodate running the Spark cluster.

Addendum to mongodb on rocksdb-Haswell

So I finally got around to doing the install on the physical box today, and on the initial make static_lib, I got this:

>/opt/src/rocksdb>make static_lib
 GEN util/build_version.cc
 CC db/c.o
 /tmp/ccActpEz.s: Assembler messages:
 /tmp/ccActpEz.s:14025: Error: no such instruction: `shlx %rsi,%r9,%rax'
 /tmp/ccActpEz.s:40750: Error: no such instruction: `shlx %rcx,%r11,%rsi'
 /tmp/ccActpEz.s:40764: Error: no such instruction: `shlx %rax,%r11,%rdx'
 /tmp/ccActpEz.s:41219: Error: no such instruction: `shlx %r8,%rsi,%r9'
 /tmp/ccActpEz.s:41231: Error: no such instruction: `shlx %rdx,%rsi,%rax'
 /tmp/ccActpEz.s:41497: Error: no such instruction: `shlx %rbx,%rsi,%rdi'
 /tmp/ccActpEz.s:41511: Error: no such instruction: `shlx %r14,%rsi,%rdx'

The machine I was installing on has Haswell CPUs, which was what was causing this error. So we had to download/compile/install binutils-2.2.5, and then add that to the path. The link is here.

Installing MongoDB with RocksDB as a storage engine

I honestly don’t really understand most of the words I just wrote in that headline. Nevertheless, when our dba or our software team asks us to install something, install it we do.

This one was fairly tricky, and I’m hoping this post will help some people.

First, some words about our environment. We run Scientific Linux 6.x in our datacenter primarily, and we maintain a shared /usr/local directory (mounted via nfs) across most of the servers. This makes installing software somewhat tricky in two areas: Most new/Big Data/open source/whatever software these days is expecting the latest and greatest code, which tends to show up in Ubuntu. Unfortunately, Ubuntu is not all that well suited to some other stuff we do, and tends to not be as easy to manage in the datacenter. The other trickiness is that we want to avoid the default install location of /usr/local for a lot of software, unless it’s being installed for our HPC cluster.

Aaaaanyway. So that basically meant I had to install a number of packages for this to work, as well as compiling with a non-standard location for gcc, g++, etc. And I also had to install the software itself in a non-standard location. I also had to call a newer version of Python than is standard on SL (and you DON’T want to install over the standard version!)

I started out from this blog post. I’m sure those instructions work for a standard install on Ubuntu 14.xx or later, but they need a little expanding for RHEL/CentOS/Scientific Linux.

Set up the path and install some prereqs:
export PATH=/usr/local/gcc/bin:/usr/local/python-2.7.8/bin:$PATH
yum install libmpc --enablerepo=epel
yum install snappy-devel zlib-devel bzip2-devel

Install the rocksdb libraries:


git clone https://github.com/facebook/rocksdb.git
cd rocksdb
git checkout mongorocks

Change the INSTALL_PATH line in Makefile change to preferred install location (/opt/rocksdb):

INSTALL_PATH ?= /opt/rocksdb

And make/make install

make static_lib
sudo make install

Download scons, which you will need to build mongo. Note that the version of scons available in the standard repos is not new enough, so you get to compile this from source, too:


curl -O http://iweb.dl.sourceforge.net/project/scons/scons/2.3.4/scons-2.3.4.tar.gz

Install and add it to your path:

python setup.py install --prefix=/opt/scons
export PATH=/opt/scons/bin:$PATH

Download the latest version of Mongo that works with rocksdb:


git clone https://github.com/mongodb-partners/mongo.git
cd mongo
git checkout v3.0-mongorocks

Here’s the tricky bit. Scons seems to be rather poorly documented, or at least it’s so flexible that documenting it isn’t helpful. These are the commands I figured out, bearing in mind that our version of GCC (4.9.1) is at /usr/local/gcc and I installed the rocks libraries at /opt/rocksdb:


scons all --cxx=/usr/local/gcc/bin/g++ --cc=/usr/local/gcc/bin/gcc --cpppath=/opt/rocksdb/include --libpath=/opt/rocksdb/lib --rocksdb=1 --prefix=/opt/mongo

scons --cxx=/usr/local/gcc/bin/g++ --cc=/usr/local/gcc/bin/gcc --cpppath=/opt/rocksdb/include --libpath=/opt/rocksdb/lib --rocksdb=1 --prefix=/opt/mongo install

So I’m not certain I need all that jazz for the install line, but it seemed to work, so better safe than sorry. This took about 4 hours to compile in the little VM I was running it in, so I didn’t feel like rebuilding over and over again once I had it actually functional.

Now, in my case, I found that because I couldn’t figure out how to make scons use static libraries and actually work, my dba will need to use this command before running mongo:


export LD_LIBRARY_PATH=/usr/local/gcc/lib64

If anyone can figure out how to make the static libraries with a non-standard location, please include it in the comments. I tried –static, but that caused the compile to throw errors about not being able to find lpthread (??), which it finds just fine without the static libraries.

Breaking Down the Monster III

So, finishing this off.

It-sa bunch-a case lines!

Write first:

 

echo $1 $2 "filesize: "$3 "totalsize: "$4"G" "filesperdir: "$5
case $1 in
	write)
        if [ $2 = scality ]; then
            filecount=$totfilecount
            time scalitywrite
            exit 0
        fi
        

So if it’s a Scality (or other pure object storage), it’s simple. Just run the write and time it, which will output the info you need. OTHERWISE…

#Chunk file groups into folders if count is too high
	if [ $totfilecount -ge 10000 ]; then
	    for dir in `seq 1 $foldercount`; do
	        createdir $fspath/$dir
	    done
	    time for dir in `seq 1 $foldercount`; do
	        path=$fspath/$dir
		filecount=$(( $totfilecount / $foldercount ))
	        writefiles
	    done
	else
	    path=$fspath
            createdir $path
            filecount=$totfilecount
            time writefiles
	fi
	;;

 

Do what the comment says. Chunk the files into folders, since if you write to a filesystem, count of files in directories makes a big difference. . Make sure you create the directories before you try to write to them… and then time how long it takes to write all of them. If it’s less than the critical file count number, then just write them and time it.

Neeeext….

 

read) #in order read
	sync; echo 1 > /proc/sys/vm/drop_caches
        if [ $2 = scality ]; then
            filecount=$totfilecount
            time scalityread
            exit 0
        fi
	if [ $totfilecount -ge 10000 ]; then
		time for dir in `seq 1 $foldercount`; do
			path=$fspath/$dir
			filecount=$(( $totfilecount / $foldercount ))
			readfiles
		done
	else
		path=$fspath
		filecount=$totfilecount
		time readfiles
	fi
	;;

That sync line is how you clear the filesystem cache (as root) on a Linux system. This is important for benchmarking, because let me tell you, 6.4GB/sec is not a speed that most network storage systems can reach. Again, we split it and time all of the reads, or we just straight up time the reads if the file count is low enough. This routine reads files in the order they were written.

 

	rm) #serial remove files
        if [ $2 = scality ]; then
            time for i in `seq 1 $totfilecount`; do
                curl -s -X DELETE http://localhost:81/proxy/bparc/$fspath/$i-$suffix > /dev/null
            done
            exit 0
        fi
		if [ $totfilecount -ge 10000 ]; then
			time for i in `seq 1 $foldercount`; do
				rm -f $fspath/$i/*-$suffix
				rmdir $fspath/$i
			done
		elif [ -d $fspath/$3 ]; then 
			time rm -f $fspath/*-$suffix
		fi
	;;

Similar to the other two routines, if it’s an object based, do something completely different, otherwise remove based on file path and count of files.

 

	parrm) #parallel remove files
		time ls $fspath | parallel -N 64 rm -rf $fspath/{}
	;;

This one is remarkably simple. Just run parallel against an ls of the top level directory, and pipe it into rm -rf. The {} is stdin for parallel. The -N 64 is number of threads to run.

 

This one’s kind of neat:

	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				shufreadfiles
			done
		else
			path=$fspath
			filecount=$totfilecount
			time shufreadfiles
		fi
	;;
	

I needed a way to do random reads over the files I’d written, in order to simulate that on filesystems with little caching (ie, make the drives do a lot of random seeks.)

	shufread) #shuffled read
		sync; echo 1 > /proc/sys/vm/drop_caches
		if [ $totfilecount -ge 10000 ]; then
			folderarray=(`shuf -i 1-$foldercount`)
			time for dir in ${folderarray[*]}; do
				path=$fspath/$dir
				filecount=$(( $totfilecount / $foldercount ))
				shufreadfiles
			done
		else
			path=$fspath
			filecount=$totfilecount
			time shufreadfiles
		fi
	;;
	

At first, I tried writing the file paths to a file, then reading that, but that has waaaay too much latency when you’re doing performance testing. So, after some digging, I found the shuf command, which shuffles a list. You can pass an arbitrary list with the -i flag. I tossed this all into an array, and then it proceeds like the read section.

 

	*) usage && exit 1;;
esac
echo '------------------------'

Fairly self explanatory. I tossed an echo with some characters in to keep the output clean if you’re running the command inside a for loop.

And that’s it!

Breaking down that monster

Or should I use Beast? No, this isn’t an XtremIO. (sorry, I just got back from EMCWorld 2015. The marketing gobbledygook is still strong in me.)

So, first part of the script, like many others, is a function (cleverly called usage), followed by the snippet that calls the function:


usage () {
	echo "Command syntax is $(basename $0) [write|read|shufread|rm|parrm] [test|tier1|tier2|gpfs|localscratch|localssd|object]"
        echo "[filesizeG|M|K] [totalsize in GB] (optional) [file count per directory] (optional)"
}

if [ "$#" -lt 3 ]; then
	usage
	exit 1
fi

Not much to see here if you already know what functions are and how they’re formatted in bash. Basically, if it starts with () { and then is closed with }, it’s a function, and you can call it like a script inside the main script. The code is not executed until it is called by name. You can even pass it input variables–more on that later.

Next, we come to a case block:


case $2 in
	test) fspath=/mnt/dmtest/scicomp/scicompsys/ddcompare/$3 ;;
	tier1) fspath=/mnt/node-64-dm11/ddcompare/$3 ;;
	tier2) fspath=/mnt/node-64-tier2/ddcompare/$3 ;;
	gpfs) fspath=/gpfs1/nlsata/ddcompare/$3 ;;
        localscratch) fspath=/scratch/carlilek/ddcompare/$3 ;;
        localssd) fspath=/ssd/ddcompare/$3 ;;
        object) fspath=/srttest/ddcompare/$3 ;;
	*) usage && exit 1;;
esac

This checks the second variable and sets the base path to be used in the testing. Note that object will be used differently than the rest, because all of the rest are file storage paths. Object ain’t.

Then, we set the size of the files (or objects) to be written, read, or deleted:


case $3 in
	*G) filesize=$(( 1024 * 1024 * `echo $3 | tr -d G`));;
	*M) filesize=$(( 1024 * `echo $3 | tr -d M` ));;
	*K) filesize=`echo $3 | tr -d K`;;
	*) usage && exit 1;;
esac

Note that I should probably be using the newer call out to command style of $( ) here, rather than backticks. I’ll get around to it at some point.

The bizarre $(( blah op blah )) setup is how you do math in bash. Really.

The next few bits are all prepping how many files to write to a given subdirectory, how big the files are, etc.


#set the suffix for file names
suffix=$3

#set the total size of the test set
if [ ! -z $4 ]; then
	totalsize=$(( 1024 * 1024 * $4 ))
else
	totalsize=52428800 #The size of the test set in kb
fi
	
#set the number of files in subdirectories
if [ ! -z $5 ]; then
	filesperdir=$5
else
	filesperdir=5120 #Number of subdirs to use for large file counts
fi

#set up variables for dd commands
if [ $filesize -ge 1024 ]; then
	blocksize=1048576
else
	blocksize=$(( $filesize * 1024 ))
fi

#set up variables for subdirectories
totfilecount=$(( $totalsize / $filesize ))
blockcount=$(( $filesize * 1024 / $blocksize ))
if [ $filesperdir -le $totfilecount ]; then
	foldercount=$(( $totfilecount / $filesperdir ))
fi

OK, I’ll get into the meat of the code in my next post. But I’m done now.