Sunday 12/9/2012 Outage Postmortem

Status
Not open for further replies.

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,098
Location
San Antonio, Whitefish, New Orleans
Sunday 12/9/2012 Outage Postmortem

As many of you are aware, we had an outage on Sunday afternoon that lasted approximately 3 hours. At that time, all Web access to RadioReference.com and Broadcastify.com was rendered inaccessible.

The cause:

An invalid proxy entry for read only database queries on our Web servers. We use a piece of software called haproxy to load balance and proxy all read only database queries to our back-end database servers. This allows us to bring database servers online and offline without impacting Web users. In this case the proxy believed that a server was up and running when it in fact wasn't, thus sending queries that "hung" and quickly brought the Web servers to a halt.

The reason it took so long to get fixed:

1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.

2) After realizing the site was down, problem determination took additional time as the issue was very tiny, but complex, and was difficult to isolate (translation: I had no idea what was actually wrong for about 1.5 hours). There were a lot of steps and items to check to isolate the root cause.

Going forward:

1) We will better keep an eye on receiving and reacting to alerts

2) We will evaluate our database haproxy configurations on our Web servers to prevent this from occurring again.

TL;DR: Web servers went down due to misconfiguration, administrator (me) didn't respond for almost 1.5 hours on Sunday afternoon to alerts, took another 1.5 hours to fix.
 

Redneck0410

Member
Joined
Mar 3, 2008
Messages
1,002
Location
Hutchinson, KS
Sunday 12/9/2012 Outage Postmortem

As many of you are aware, we had an outage on Sunday afternoon that lasted approximately 3 hours. At that time, all Web access to RadioReference.com and Broadcastify.com was rendered inaccessible.

The cause:

An invalid proxy entry for read only database queries on our Web servers. We use a piece of software called haproxy to load balance and proxy all read only database queries to our back-end database servers. This allows us to bring database servers online and offline without impacting Web users. In this case the proxy believed that a server was up and running when it in fact wasn't, thus sending queries that "hung" and quickly brought the Web servers to a halt.

The reason it took so long to get fixed:

1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.

2) After realizing the site was down, problem determination took additional time as the issue was very tiny, but complex, and was difficult to isolate (translation: I had no idea what was actually wrong for about 1.5 hours). There were a lot of steps and items to check to isolate the root cause.

Going forward:

1) We will better keep an eye on receiving and reacting to alerts

2) We will evaluate our database haproxy configurations on our Web servers to prevent this from occurring again.

TL;DR: Web servers went down due to misconfiguration, administrator (me) didn't respond for almost 1.5 hours on Sunday afternoon to alerts, took another 1.5 hours to fix.

Thanks for the update. I'm sure a lot of people were kind of curious as to what was going on. I know I was. Next time, may we suggest you double-check to make sure your cell phone is with you before you leave the house/office? *chuckle* These type of things happen, and thankfully our member-base is very understanding.
 

N1XDS

ÆS Ø
Joined
Nov 3, 2004
Messages
1,932
Make a note to yourself saying hey boss don't forget your cell phone if you leave to go some place. :D
 

fxdscon

¯\_(ツ)_/¯
Premium Subscriber
Joined
Jan 15, 2007
Messages
7,160
Sunday 12/9/2012 Outage Postmortem

1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.


Hmmmmm... No Bombay Sapphire Martini for you today!!

On second thought, I guess that really would be "cruel and unusual punishment". :eek:

Thanks for the report!

.
 

kc8dhx

Member
Premium Subscriber
Joined
May 26, 2010
Messages
49
Location
Gainesville FL
Been there done that, unfortunately this is how we learn not to make the same mistakes twice as administrators. Thanks for the update and your service.
 

cdknapp

Member
Premium Subscriber
Joined
Sep 26, 2009
Messages
115
Location
Rochester, NY
Lindsay, I DON'T envy your job! There's a responsibility that grows bigger with each and every member that joins here, and this has become a very big community. I don't play the 'one up man ship' that many do, but if I could make a suggestion: With the size that this thing has become, maybe have a back up for urgent notifications; maybe a second person. Things do happen; and personally, I do understand that (and hope everyone else does as well). Please don't beat yourself up too bad over this fluke. You and your people are doing a great job that I know I appreciate!
 

SCPD

QRT
Joined
Feb 24, 2001
Messages
0
Location
Virginia
I never even noticed that it went down .... and I am sure that I am not the only one. So no big deal, but thanks for letting us know what happened.

As a side note .. I belong to another web site and they went down this weekend for like 2 days with "hardware problems". They didn't say much of anything to anyone, and really it is refreshing to see just the opposite happening here at RR. I just wish that the other site had as good support as RadioReference has.
 

Dude111

An Awesome Dude
Joined
Aug 8, 2009
Messages
446
Im sorry this happend guys,i didnt come to the site on sunday so i didnt realise (Otherwise i would emailed ya straight away) -- Im glad ya got it corrected :)
 
Status
Not open for further replies.
Top