As many of you probably know, we had a complete outage today on the site for more than 1 hour, due to our Web server's load being off the charts.
I arrived home from a nice relaxing hike with the family to see that our Web server was not responding, so a quick look at the server revealed this (after waiting almost 5 minutes for the login)
For those of you that aren't Unix system administrators, the above output from the command "uptime" tells you primarily how long the server has been running since last reboot, how many users are logged on to the machine's consoles, and how much load the server is experiencing. Two staggering statistics in that output - one, our Web server hasn't been rebooted in 457 days! two, the load average was off the charts, the most I've ever seen out Web server at. It typically runs at about 2-4, and on a busy day 5-6. 255 means an unresponsive server and nothing is going to happen.
I brought down the database server instance (on another server) first to see if that would help since I couldn't really do anything on the Web server except execute a command and wait 5 minutes for a response. But that didn't help, so bringing down apache (our web server software) did the trick. After some updates, rebooting the entire infrastructure, and bringing eerything back up gracefully, it appears we are ok now.
That is the most downtime we've had in probably 2 years.
So what happened? Well, we don't definitively know. A possible bug in apache might have caused a memory leak, a backdated Kernel might have caused a loading problem, who knows. But after some patch updates, and a nice reboot, it looks like we're OK.
Warm regards,
I arrived home from a nice relaxing hike with the family to see that our Web server was not responding, so a quick look at the server revealed this (after waiting almost 5 minutes for the login)
Code:
12:58:41 up 457 days, 23:08, 1 user, load average: 255.79, 255.77, 252.85
For those of you that aren't Unix system administrators, the above output from the command "uptime" tells you primarily how long the server has been running since last reboot, how many users are logged on to the machine's consoles, and how much load the server is experiencing. Two staggering statistics in that output - one, our Web server hasn't been rebooted in 457 days! two, the load average was off the charts, the most I've ever seen out Web server at. It typically runs at about 2-4, and on a busy day 5-6. 255 means an unresponsive server and nothing is going to happen.
I brought down the database server instance (on another server) first to see if that would help since I couldn't really do anything on the Web server except execute a command and wait 5 minutes for a response. But that didn't help, so bringing down apache (our web server software) did the trick. After some updates, rebooting the entire infrastructure, and bringing eerything back up gracefully, it appears we are ok now.
That is the most downtime we've had in probably 2 years.
So what happened? Well, we don't definitively know. A possible bug in apache might have caused a memory leak, a backdated Kernel might have caused a loading problem, who knows. But after some patch updates, and a nice reboot, it looks like we're OK.
Warm regards,
Last edited: