When I started working as a sysadmin (about 10 years ago) there was this obsession about uptime. Everyone considered this the greatest sign that you are doing a good job as a sysadmin if you were able ‘to keep the machine running’ for a long time. Looking back, I believe this was mainly because there were not so many systems in place at that time, and everything was in the early days: we were running linux kernel 2.x, we had some ‘fancy’ pentiums as super servers, and were doing fancy bgp exchanges with cisco 3600 routers, and most of our clients were using dial-up lines to connect. Uff… those were fun times ;-) Anyway, we didn’t had failover systems implemented, nor did we had fancy monitoring and reporting on all possible things, like we started to implement as business was starting to depend on those systems more and more. During that time, any sysadmin I knew would show how good he was based on the uptime he was able to run one of his ‘core’ servers. When we had to reboot for something (hardware upgrade, or failure, etc.) this was a tragedy as we were losing ‘the uptime’.
Now, after all this time I realized that I don’t care about this at all. I am working with completely different systems and they are mostly redundant, where most of the times taking down a system means to schedule downtime in nagios so it doesn’t trigger the alerts, but this will not affect the system in general as failover will take place immediately. I am no longer looking at the uptime of one machine, but on the uptime and reliability of the system the machine belongs. This is why the moment when I was doing a consulting job for a client looking at his server for the first time (he had only 3 machines independent on each other), I sow this (this is just a copy/paste):
srv01:~# uptime 20:38:31 up 1119 days, 47 min, 1 user, load average: 0.76, 0.59, 0.68
this means a little over 3 years (like February 2006)… Wow… I said to myself: the old sysadmin didn’t care about kernel updates (@2.6.15). I said… hmm… maybe he was doing application upgrades at least; looking at mysql (@4.1.15 btw) this was up for 1043 days and 11:52 hrs. Should I be impressed? or disappointed about a poor job and lack of interest in system maintenance and upgrades from the previous admin? I was disappointed of course…
I just realized that I was looking into this situation from my recent works and experiences. But then I asked myself “does uptime matter or not?". My answer today would be that of course it does matter, regardless if we are looking at one individual system or a bigger setup that is fully redundant. Still, I would never sacrifice security and application updates because of this. We still need to have maintenance windows where we can keep the systems updated, secured and in good shape, even if this means rebooting them from time to time to fix whatever kernel bugs.
I am interested to hear your opinion… What do you think? does uptime still matter?