-
RAM
So I went back to The Chip Merchant to exchange my RAM. The guy said that it was unlikely that both RAM modules were bad and went off to talk to his tech.He came back and said that my report of bad RAM wasn't the first and there appears to be an issue with the combination of RAM, motherboard, and processor they sold me, so they gave me a different type of RAM and told me that the DDR 400 could only operate at 333, but the BIOS's setting of Auto should detect it. I popped the RAM in, set the BIOS to a RAM frequency of 333 and will cross my fingers. It only took me a few days of futzing around to come to the same conclusion.
-
Stable server?
Now that my server is rebuilt, my problem is that it keeps crashing kernel panicking and I saw segmentation faults all over the place. All roads point to hardware problems. So how do i solve this? Well, first off, my old memory modules work in the new machine. I installed one of them (512 MB) and the machine seemed to stay up all night with one exception. I noticed that it had rebooted at 5:32 am. In all the other crashing, it never once rebooted. That got me thinking that the UPS I plugged the machine into (an old one) wasn't powerful enough and a surge that put the system on battery failed to move it to battery and the server restarted. At least, that's what I hope happened. So I got to thinking, how could 2 brand new memory modules fail. I remembered that when I was handed the memory, they were in adjoining pouches. I checked the serial numbers and they were 12 apart meaning that they most likely came from the same batch and if a batch was bad, both modules could be bad. So this evening I used a program called Memtest86 which supposedly thoroughly tests RAM. I popped in each new RAM modules one at a time and after less than a minute, each module showed thousands of errors. Then I put both in and after 20 minutes I saw 500+ errors; I'm not sure why the results were different with 1 vs. 2, but it convinced me that there was a real problem. I then tested my 2 old memory modules (slower, but the same capacity) and after an hour, they showed no errors.
-
Server Recovery
This sure has been a nightmare to get my server running again adequately. I got almost everything working yesterday and today I tackled converting to software RAID1 so that I have a mirror. With most Linux tasks, there is some help on the web. A co-worker pointed me to a document for "crazy sysadmins". I didn't think that applied to me, until I re-read it several times and realized that it is almost what I need. I followed the directions and was stoked that things were going smoothly. Then came the hard part, rebooting. I always have problems with grub, fstab, etc. After much Google searching and futzing, I figured out the solution...I had to rebuild the ram disk image that got loaded so that it knows to boot off the RAID. This normally wouldn't be necessary, but the default Fedora Core 3 install used an LVM volume and the old initrd file was based on that. So, I figured out that:
mkinitrd -v --preload=raid1 --fstab=/mnt/newroot/etc/fstab initrd-2.6.12-1.1378_FC3.img 2.6.12-1.1378_FC3
worked. It's hard to tell from the documentation what is going on, but if you don't specify the fstab file, it uses the current active one which happens to have the LVM mess in it.Just to make sure I didn't screw anything up, I removed the original drive and setup a clean drive as the second drive for the RAID (I bought 4 drives with the idea that 2 were for the RAID and 2 were hot swappable spares).In about 40 minutes when the drives finish mirroring, I'll restart the server and see what happens.I'm now convinced more than ever that sysadmins (at least those that run Linux/UNIX machines) don't make enough money. It is extremely frustrating to have a server crash and then to have trouble restoring it. I also forgot to mention that one of the times I was restarting the server, it tripped my UPS and somehow killed the UPS. The UPS definitely has enough capacity for the server, but something went haywire and I have to get the UPS replaced. A new one will be here in 5-7 business days. I do have a spare, but it's significantly smaller. -
Server crashed again
This time, my backup was corrupted and the server seemed hosed, so I got a new one and started rebuilding from backups. Unfortunately the backup appears to be corrupt (I think it was the drive as I restored parts later from another backup from last week and the files came across fine). I still have a long way to go, but mail and web are back up. I hate computers.