In this post I will present a little story of what happened to me today. As I was working to upgrade the kernel on one server (remote of course), something very funny (at least if I look at it now) happened. When upgrading a kernel on a remote server there is always a chance (even if you are very experienced and done this several times, still there is a small chance) that something will not work as expected and when rebooting the system to no have it back online. Even though I have a good experience on doing this and I can’t remember since I have ‘lost’ a system when upgrading its kernel, I am always very careful when doing this.
Depending from the datacenter the server may have different remote management options besides the normal ssh connection: remote serial console, DRAC card (on Dell PowerEdge servers), KVM, or none. In this particular case I had a remote serial console enabled on the system. Since this server is in a load balancer setup, I could work on it without any problem, without affecting the site it is serving. I took the kernel config file from the previous kernel, verified the changes, compiled, installed, added the proper entry in grub, as you would expect on a kernel upgrade. After double-checking the grub entry again, I have logged on the remote console (if I had it, why not… if not I would have rebooted directly), and restarted the system.
The server was stuck…
The system rebooted as expected and I’ve chosen the new kernel in grub at boot time and then after normal kernel messages it stopped at the following line:
hmm… very strange… I reviewed all the messages above, nothing, no error at all… Still the system was stopped there and was apparently not doing anything.
Reboot with the original kernel.
Ok, I said… no problemo… I have done probably something wrong as I am very busy and very tired these days… so maybe I have done something wrong… So I thought I will reboot with the previous kernel and double check again and see what I did wrong. Rebooted (using the datacenter control panel to reset the system by cutting its power), and started the kernel that was running previously. Surprise… the exact same thing… the system was stopping at the same line. …uff… what a way to start a Saturday morning… why did I started this today?
Trying out various kernel options
So I have started and rebooted several times the system and entered different kernel parameters (acpi=off, apic=verbose, disable udev, loglevel=7, nosmp, etc.) hoping that I will understand the real problem. Nothing helped and the system was always stopping at the same place.
What was the real problem? there was nothing wrong really, it was just running fsck…
Finally I realized that the remote serial console was not printing all the messages for me and it was redirecting them to the regular console… In order to have the kernel messages printed at the serial console I had to the kernel line added the following options:
and I remembered that this kind of configuration will print ALL the kernel messages only on the LAST console. So I was not seeing everything… Uff… Rebooted with only the serial console enabled (removed console=tty0 completely). and finally I have seen that the system was not giving any error at all… It was just running fsck and since the disk was very big it was taking very long to complete:
1 2 3 4 5 6 7 8 9
So there was really nothing wrong with the kernel upgrade, but since I have not seen what was happening I was assuming that there was something broken. If I didnâ€™t had the remote serial console this would have been solved much faster since I would have just seen the system not starting, and I would have assumed that there is something wrong, and contacted the datacenter for help (reboot, KVM, etc); until I would have found them and they would had taken action the fsck would have probably finished and the system was back online.
Conclusion: things are not always as bad as they seem. If you have a similar situation and your system is not coming back online as fast as expected, in case it was not rebooted in a long time there might be a chance to have fsck running on your root device. If the root device is big (how it was in this case 500G – not my install btw) then it can take some time to complete. Of course that if I thought that this might happen I could have seen this before with tune2fs, or set it to do this check on a different time if needed. I hope that this was a fun story to read on a weekend day… now it seems funny to me also. But definitely not at that time :–).
Tune2fs output (after the successful reboot):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41