Practical Technology

for practical people.

My Great Linux System Repair Adventure

| 1 Comment

Thunder storms in the Blue Ridge Mountains can come fast. That’s why my main Linux desktop system was still up when one, two, three lightning bolts slammed near my home. Thus began my Great Linux System Repair Adventure.

Despite no fewer than three power surge protectors, including a master power protector for the entire house, just enough of a surge hit my Insignia 300a, an older Best Buy house-brand desktop PC with a 2.8GHz Pentium IV, GB of RAM, and an Ultra ATA/100 60GB hard drive, running SLED 10 (SUSE Linux Enterprise Desktop).

At first, everything looked OK. Then I began getting odd disk errors and programs started misbehaving. So, I used that master tool of all Linux/Unix file repair, fsck, to see what was wrong with my drive.

It wasn’t pretty. There were file system errors here, errors there, errors just about everywhere. I use the ReiserFS, because I really like its speed and space performance. On the other hand, when things go badly wrong, getting the ReiserFS fsck file tree rebuild to work properly can be very tricky.

For simple file problems, ReiserFS restores the file system by replaying its transaction log journal. That’s as it should be. The whole point of a journaling file system is that you can replay disk writes when something goes wrong.

I was well beyond simple problems, though. It was time to unmount the file system — reiserfsck won’t repair mounted systems — and get serious. So, I ran the command:

    reiserfsck  -- rebuild-tree

This option forces reiserfsck to, just like it says; rebuild its b-tree map of the file system.

It was humming along, when bang, it stopped. OK, this can happen. Maybe there’s a bad block. The simple-minded way to find out, without reaching for another tool, is to simply try the command again. If it breaks at the same spot, I’ve got bad blocks, actual sectors of the hard drive that can’t hold data reliably.

If that’s the case, I’d run:

    /sbin/badblocks [-b (reiserfs-block-size)] device

to get a list of bad blocks that reiserfsck can understand. After that, I’d run dd_rescue to create a backup of the file system without the bad blocks. Yes, you can try this with other tools — dd comes to mind immediately — but dd_rescue, unlike dd, doesn’t abort on errors. I could program around that, but dd_rescue does such a good job; so, why bother?

Unfortunately, reiserfsck blew up at a new location… farther along in the rebuilding. OK, so it wasn’t a bad block. This required some thought.

I decided to boot the system back up and see what the system looked like from my KDE 3.5 interface. From there, I planned on backing up my system, which I hadn’t done in a week, with KDar, the KDE disk archiver to a DVD-R. Unfortunately, I didn’t make it that far.

The system got about halfway to the desktop when the boot process failed. OK, now I was getting ticked. It was time to return to single user mode and the command line.

It was also time to get hard-core serious about this misbehaving drive. This time I ran:

    reiserfsck  -- rebuild-tree -S

This forces the b-tree to be rebuilt from any part of the directory and file system or b-tree leaves that may be lying anywhere on the partition. Unless you really — I mean really — know how file systems work, don’t try this home. Go to a friend’s house. It will be much safer there.

No, don’t. That was meant to be funny. Don’t try this anywhere, unless you really know what you’re doing.

Believe it or not, I do know file system internals so I ran it… and I got most of the way through when the process stopped and I got the message:

    The problem has occurred looks like a hardware problem (perhaps memory).

Oh no. Could the memory also be sour? The hard drive was fouled up, no question about that, but remember, I’d also seen strange problems with applications. That, I now remembered, is often a sign of bad memory.

I ran the command again. Yes, there was the same error message, but at a different point in the repair process. This was looking more and more like I actually had two problems.

So, I got up, turned off the system, and went to have lunch. When I came back, I turned the PC back on… and it wouldn’t boot at all.

This was turning into a really bad day.

So, now I pulled out my freshly burned copy of SystemRescueCd 0.35. SystemRescueCd, if you’ve never met it, is the best single CD bootable system repair disk I know.

This special purpose Linux distribution is based on the 2.6.20.7 Linux kernel. It includes:

  • GParted, a top-notch partition manager
  • PartImage, a great drive/partition imager tool
  • NTFS3 an open-source program that enables you to mount and read and write to a Windows NTFS

…and a host of file system repair tools and drivers. It also includes — and for me this is the cherry on top of the Sundae — network file tools like Samba and NFS. With those, you can send files from a near-dead machine to a network server for safe keeping.

So, I popped in SystemRescueCD, and, with its small memory footprint of 128MB, it appeared to load fine. This time I ran reiserfsck from SystemRescueCD and… it failed with a memory error, again. This time, at least, it almost completed the run.

OK, it was time to play with the hardware. When memory is going bad, you can sometimes keep it going for a while longer by slowing it down.

Normally, people only play with memory settings when they’re trying to turbo-charge a gaming system or the like. The same techniques, applied in reverse, can sometimes get some useful life from sick systems like mine.

Now, playing tricks with RAM is a subject unto itself. For more on that subject, visit sites like Extreme Tech and Tom’s Hardware and look for stories on overclocking.

I was going the other way; I was going to “underclock” my system’s memory. To do this, I went to my PC’s advanced BIOS section. For my purposes, I started with slowing down the CAS (column address strobe) latency. This setting determines how many clock cycles the system waits before issuing a CAS signal and outputting data from the memory chip. A higher value means more waiting, therefore a slower computer, and a bit more memory reliability.

After setting this up, I rebooted again with SystemRescueCD, ran reiserfsck with all the trimmings, and this time it worked. I once more had a viable file system.

Now, my problem was how to get the important files out of there before something else went wrong. Trying to repair the system was a task for another day. Today, I just wanted my files safe, snug and well away from that machine.

My new problem, though, was that my important files, in /home/sjvn, came to a whopping 22 gigabytes. Yes, I’m a file and email packrat.

22 gigabytes is way too much for burning to a DVD or a USB stick. For the first time, I found myself wishing for a Blu Ray disc burner on a PC. Even over my 100Mbps Fast Ethernet connection, I really didn’t want to waste time sending all that data.

The solution was clearly to compress my files down and put them into a more conveniently sized archive for shipping across the network. Linux is full of tools to do that, but tar, that old faithful, was the first program that came to mind.

So, I mounted the repaired partition, headed over to the /home/sjvn, and zapped a lot of junk files with “rm.” Then I hopped back up to the /home directory and ran:

    tar cvzf sjvn/sjvnhomedir.tar.gz sjvn

This created the compressed archive “sjvnhomedir.tar.gz” in /home/sjvn. The tar options were the basics: “c” for create; “v” for verbose (I wanted to know what was going on); “z” for compress files with gzip; and “f” to give the archive its name.

Now, I was left with only one final step: getting my important files, now zipped up in “sjvnhomedir.tar.gz,” to a healthy computer. I decided once more to go with easy, over other alternatives.

This time, that meant setting up an SSH (secure shell) server on the sick machine. To do this, I had to give the machine a root password; anything will do. Then, log in with it and run:

    /etc/init.d/sshd start

That starts up the SSH server. And that was the last thing I had to do on that system.

I then moved to another Linux system. In my case, that just mean I used my IOGEAR KVM (keyboard, video, and mouse) switch to click over to the MEPIS 6.5 system sitting right next to the sick SLED 10 box.

Once logged in on the MEPIS PC, I logged into the SLED system’s SSH server as root, and moved to the /home/sjvn directory. Once there, I used scp (secure copy) to copy sjvnhomedir.tar.gz to my MEPIS system, like so:

    scp sjvnhomedir.tar.gz sjvn@MEPIS:

At long last, I was done. I had my files safely stored away.

Today, the sick PC is back to working, albeit at a slower speed. I don’t trust it as a front-line system, so I replaced it with an HP Pavilion a6040n. That PC is now my main SLED system. On it, safe and sound, is every file I rescued from the sick computer.

My point in telling you of my misadventure is that, with a little knowledge and Linux tools, which SystemRescueCD brings together for you, you can save your files even from apparently hopeless situations.

Oh, and a final note: SystemRescueCD can also work the same magic on your Windows systems. I can’t recommend this mini-distribution enough for anyone who might face repairing any Unix, Linux, or Windows-based computer.