RadioReference Outage Today 2/22/2009

Status
Not open for further replies.

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,591
Reaction score
6,812
Location
Dallas, TX
As many of you probably know, we had a complete outage today on the site for more than 1 hour, due to our Web server's load being off the charts.

I arrived home from a nice relaxing hike with the family to see that our Web server was not responding, so a quick look at the server revealed this (after waiting almost 5 minutes for the login)

Code:
12:58:41 up 457 days, 23:08,  1 user,  load average: 255.79, 255.77, 252.85

For those of you that aren't Unix system administrators, the above output from the command "uptime" tells you primarily how long the server has been running since last reboot, how many users are logged on to the machine's consoles, and how much load the server is experiencing. Two staggering statistics in that output - one, our Web server hasn't been rebooted in 457 days! two, the load average was off the charts, the most I've ever seen out Web server at. It typically runs at about 2-4, and on a busy day 5-6. 255 means an unresponsive server and nothing is going to happen.

I brought down the database server instance (on another server) first to see if that would help since I couldn't really do anything on the Web server except execute a command and wait 5 minutes for a response. But that didn't help, so bringing down apache (our web server software) did the trick. After some updates, rebooting the entire infrastructure, and bringing eerything back up gracefully, it appears we are ok now.

That is the most downtime we've had in probably 2 years.

So what happened? Well, we don't definitively know. A possible bug in apache might have caused a memory leak, a backdated Kernel might have caused a loading problem, who knows. But after some patch updates, and a nice reboot, it looks like we're OK.

Warm regards,
 
Last edited:

burner50

The Third Variable
Joined
Dec 24, 2004
Messages
2,305
Reaction score
171
Location
NC Iowa
Thanks for the swift response!


I guess the server was due for a reboot.
 

fineshot1

Member
Joined
Sep 17, 2004
Messages
2,531
Reaction score
21
Location
NJ USA (Republic of NJ)
Thanks Lindsay - that is a long time for no reboot even for unix. I used to do reboots on the one unix platform I was responsible for(about 10 years ago) at about 90 day intervals whether I thought it needed it or not. Most of the time it was a log or log files that would not auto purge and get too large that would gum up the works and make the system run unstable.
 

epilab

Member
Premium Subscriber
Joined
Jul 29, 2007
Messages
78
Reaction score
8
Location
Mass
At least you did not get a virus that coiuld have killed everyone
 

bootsfirst

Member
Database Admin
Joined
Sep 3, 2007
Messages
61
Reaction score
0
Location
Kansas
DDoS, perhaps?

But I think it'd continue even after the server was brought back up if that was the case...shrug.

Since we're on the subject, how about a little geek porn with you telling us all about your servers? (hardware, setup, etc etc)
 

chrismol1

Active Member
Joined
Mar 15, 2008
Messages
1,416
Reaction score
1,341
Since we're on the subject, how about a little geek porn with you telling us all about your servers? (hardware, setup, etc etc)
ohh baby!
show me them quad core xeons!
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,591
Reaction score
6,812
Location
Dallas, TX
Here are the specs on the servers:

Production Environment - hosted in data center

1 Web Server
Dell PowerEdge 440
Dual Core Intel Xeon 3040
2GB RAM
250GB SATA II Drives
100 MB Public Facing Port
100 MB VPN Port

1 Database Server
Dell PowerEdge 1430
(2) Dual Core Intel Xeon 5120
4GB RAM
250GB SATA II Drives
100 MB Public Facing Port
100 MB VPN Port

Staging Server - Home

PowerEdge 2950
(2) Quad Core Intels
12 GB RAM
2TB RAID-50

Development

My Mac Pro Workstation, 2 quad core 6GB RAM
 

roots

Member
Joined
Sep 27, 2006
Messages
191
Reaction score
0
Location
QC, Canada
that was the longest & boring around at work I ever has, :p

good to ear that everything is back to normal!
 

mtindor

FMP24 PRO USER
Database Admin
Joined
Dec 5, 2006
Messages
11,991
Reaction score
3,243
Location
Carroll Co OH / EN90LN
e login)

Code:
12:58:41 up 457 days, 23:08,  1 user,  load average: 255.79, 255.77, 252.85

Heh, that isn't pretty. I agree I haven't seen a load that high in 19 years of unix. And I'm shocked you were able to even run top. Typically, on a machine with a load above 25 its difficult / time consuming to even gain access via SSH and even console access/program execution lags signfiicantly. I'll just say I hope I never have to deal with a server load that high :)

Mike
 

K8TEK

Completely Banned for the Greater Good
Banned
Joined
Jul 13, 2004
Messages
681
Reaction score
3
Location
Ohio
Let me get this straight, you run a single dual core processor on your web server and experience loads of 2-4 on normal days? You do realize this means the machine is underpowered by at least half of what it needs to be.
 

Bubba1661

WV DB Admin
Database Admin
Joined
Oct 19, 2008
Messages
934
Reaction score
23
Location
Charleston W.V.
Don't Know About Everybody Else But I Was Going Into The DT'S. Good Job On Getting The Site Back Up!
 

mtindor

FMP24 PRO USER
Database Admin
Joined
Dec 5, 2006
Messages
11,991
Reaction score
3,243
Location
Carroll Co OH / EN90LN
Let me get this straight, you run a single dual core processor on your web server and experience loads of 2-4 on normal days? You do realize this means the machine is underpowered by at least half of what it needs to be.

It's not all about CPU usage. There are a lot of factors taken into consideration when load averages are determined, including network activity, disk activity, CPU resources used, processes in use, resources needed by processes but awaiting access to that resource, etc. A load average of 2-4 would be an indication that a server has some activity, i.e. doing more than just idling, fairly active. But it doesn't mean that the server is underpowered or that the server cannot serve requests in a timely fashion. I see servers running all the time, particularly ones processing mail and performing antispam filtering, that can 10s of thousands or 100s of thousands of spamassassin scanning events per day and do it without fail, day in and day out, 365 days a year, with load averages ranging from 5 to 20 at any given time.

I'm not saying the server isn't working, but it's no indication of what the server is capable of delivering or a sole determinator on whether it is suffering.

Mike
 

Caesar

Member
Joined
Apr 11, 2005
Messages
272
Reaction score
33
Location
Lexington, SC
Wow, great job Lindsay, i was suffering through the day, i finnaly broke done and went to take an afternoon nap. luckily i get back up and see rr.com is back uP! :)
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,591
Reaction score
6,812
Location
Dallas, TX
Let me get this straight, you run a single dual core processor on your web server and experience loads of 2-4 on normal days? You do realize this means the machine is underpowered by at least half of what it needs to be.

Cue the know-it-alls... :roll: :cool:
 
Status
Not open for further replies.
Top