Things are not always as bad as they seem...

In this post I will present a little story of what happened to me today. As I was working to upgrade the kernel on one server (remote of course), something very funny (at least if I look at it now) happened. When upgrading a kernel on a remote server there is always a chance (even if you are very experienced and done this several times, still there is a small chance) that something will not work as expected and when rebooting the system to no have it back online. Even though I have a good experience on doing this and I can’t remember since I have ’lost’ a system when upgrading its kernel, I am always very careful when doing this.

Depending from the datacenter the server may have different remote management options besides the normal ssh connection: remote serial console, DRAC card (on Dell PowerEdge servers), KVM, or none. In this particular case I had a remote serial console enabled on the system. Since this server is in a load balancer setup, I could work on it without any problem, without affecting the site it is serving. I took the kernel config file from the previous kernel, verified the changes, compiled, installed, added the proper entry in grub, as you would expect on a kernel upgrade. After double-checking the grub entry again, I have logged on the remote console (if I had it, why not… if not I would have rebooted directly), and restarted the system.

The server was stuck…

The system rebooted as expected and I’ve chosen the new kernel in grub at boot time and then after normal kernel messages it stopped at the following line:

Adding 2048276k swap on /dev/sda2.  Priority:-1 extents:1 across:2048276k

hmm… very strange… I reviewed all the messages above, nothing, no error at all… Still the system was stopped there and was apparently not doing anything.

Reboot with the original kernel.

Ok, I said… no problemo… I have done probably something wrong as I am very busy and very tired these days… so maybe I have done something wrong… So I thought I will reboot with the previous kernel and double check again and see what I did wrong. Rebooted (using the datacenter control panel to reset the system by cutting its power), and started the kernel that was running previously. Surprise… the exact same thing… the system was stopping at the same line. …uff… what a way to start a Saturday morning… why did I started this today?

Trying out various kernel options

So I have started and rebooted several times the system and entered different kernel parameters (acpi=off, apic=verbose, disable udev, loglevel=7, nosmp, etc.) hoping that I will understand the real problem. Nothing helped and the system was always stopping at the same place.

What was the real problem? there was nothing wrong really, it was just running fsck…

Finally I realized that the remote serial console was not printing all the messages for me and it was redirecting them to the regular console… In order to have the kernel messages printed at the serial console I had to the kernel line added the following options:

console=ttyS0,9600n8 console=tty0

and I remembered that this kind of configuration will print ALL the kernel messages only on the LAST console. So I was not seeing everything… Uff… Rebooted with only the serial console enabled (removed console=tty0 completely). and finally I have seen that the system was not giving any error at all… It was just running fsck and since the disk was very big it was taking very long to complete:

Will now activate swap.
Adding 2048276k swap on /dev/sda2.  Priority:-1 extents:1 across:2048276k
swapon on /dev/sda2
Done activating swap.
Will now check root file system.
fsck 1.39 (29-May-2006)
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a -C0 /dev/sda3
/dev/sda3 has gone **188 days without being checked**, check forced.
/dev/sda3: |=============                                           \ 22.8%

So there was really nothing wrong with the kernel upgrade, but since I have not seen what was happening I was assuming that there was something broken. If I didn’t had the remote serial console this would have been solved much faster since I would have just seen the system not starting, and I would have assumed that there is something wrong, and contacted the datacenter for help (reboot, KVM, etc); until I would have found them and they would had taken action the fsck would have probably finished and the system was back online.

Conclusion: things are not always as bad as they seem. If you have a similar situation and your system is not coming back online as fast as expected, in case it was not rebooted in a long time there might be a chance to have fsck running on your root device. If the root device is big (how it was in this case 500G - not my install btw) then it can take some time to complete. Of course that if I thought that this might happen I could have seen this before with tune2fs, or set it to do this check on a different time if needed. I hope that this was a fun story to read on a weekend day… now it seems funny to me also. But definitely not at that time :-).

Tune2fs output (after the successful reboot):

**tune2fs -l /dev/mapper/VolGroup00-LogVol00**
tune2fs 1.39 (29-May-2006)
Filesystem volume name:
Last mounted on:
Filesystem UUID:          3a895332-838f-41f5-8d1b-5758de68d0f8
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal filetype needs_recovery sparse_super large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              60735488
Block count:              121462784
Reserved block count:     6073139
Free blocks:              101837359
Free inodes:              60358238
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Sun Dec  4 14:27:31 2005
Last mount time:          Sat Jul 22 06:03:27 2006
Last write time:          Sat Jul 22 06:03:27 2006
Mount count:              2
Maximum mount count:      37
Last checked:             Sat Jul 22 05:53:50 2006
** Check interval:           15552000 (6 months)**
** Next check after:         Thu Jan 18 04:53:50 2007**
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
First orphan inode:       28311571
Default directory hash:   tea
Directory Hash Seed:      e61e1b99-d7e8-44d1-8c22-8cd72823b5c0
Journal backup:           inode blocks
comments powered by Disqus