RadioReference on Facebook   RadioReference on Twitter   RadioReference Blog
 

Go Back   The RadioReference.com Forums > Announcements and News > Community Announcements and News

Community Announcements and News Announcements and News of interest to the RadioReference.com Community. All new threads posted here will be moderated by the administrators. Members are encouraged to post news and information here for the community.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 12-11-2012, 5:23 PM
blantonl's Avatar
Founder and CEO
  Shack Photos
Shack photos
RadioReference Database Admininstrator
Database Admin
 
Join Date: Dec 2000
Location: Whitefish, MT
Posts: 8,236
Default Sunday 12/9/2012 Outage Postmortem

Sunday 12/9/2012 Outage Postmortem

As many of you are aware, we had an outage on Sunday afternoon that lasted approximately 3 hours. At that time, all Web access to RadioReference.com and Broadcastify.com was rendered inaccessible.

The cause:

An invalid proxy entry for read only database queries on our Web servers. We use a piece of software called haproxy to load balance and proxy all read only database queries to our back-end database servers. This allows us to bring database servers online and offline without impacting Web users. In this case the proxy believed that a server was up and running when it in fact wasn't, thus sending queries that "hung" and quickly brought the Web servers to a halt.

The reason it took so long to get fixed:

1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.

2) After realizing the site was down, problem determination took additional time as the issue was very tiny, but complex, and was difficult to isolate (translation: I had no idea what was actually wrong for about 1.5 hours). There were a lot of steps and items to check to isolate the root cause.

Going forward:

1) We will better keep an eye on receiving and reacting to alerts

2) We will evaluate our database haproxy configurations on our Web servers to prevent this from occurring again.

TL;DR: Web servers went down due to misconfiguration, administrator (me) didn't respond for almost 1.5 hours on Sunday afternoon to alerts, took another 1.5 hours to fix.
__________________
Lindsay C. Blanton III
CEO - RadioReference.com / Broadcastify
Facebook: RadioReference | Broadcastify | Twitter: @RadioReference
Reply With Quote
Sponsored links
  #2 (permalink)  
Old 12-11-2012, 5:34 PM
Redneck0410's Avatar
Member
   
Join Date: Mar 2008
Location: Hutchinson, KS
Posts: 392
Default

Quote:
Originally Posted by blantonl View Post
Sunday 12/9/2012 Outage Postmortem

As many of you are aware, we had an outage on Sunday afternoon that lasted approximately 3 hours. At that time, all Web access to RadioReference.com and Broadcastify.com was rendered inaccessible.

The cause:

An invalid proxy entry for read only database queries on our Web servers. We use a piece of software called haproxy to load balance and proxy all read only database queries to our back-end database servers. This allows us to bring database servers online and offline without impacting Web users. In this case the proxy believed that a server was up and running when it in fact wasn't, thus sending queries that "hung" and quickly brought the Web servers to a halt.

The reason it took so long to get fixed:

1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.

2) After realizing the site was down, problem determination took additional time as the issue was very tiny, but complex, and was difficult to isolate (translation: I had no idea what was actually wrong for about 1.5 hours). There were a lot of steps and items to check to isolate the root cause.

Going forward:

1) We will better keep an eye on receiving and reacting to alerts

2) We will evaluate our database haproxy configurations on our Web servers to prevent this from occurring again.

TL;DR: Web servers went down due to misconfiguration, administrator (me) didn't respond for almost 1.5 hours on Sunday afternoon to alerts, took another 1.5 hours to fix.
Thanks for the update. I'm sure a lot of people were kind of curious as to what was going on. I know I was. Next time, may we suggest you double-check to make sure your cell phone is with you before you leave the house/office? *chuckle* These type of things happen, and thankfully our member-base is very understanding.
Reply With Quote
  #3 (permalink)  
Old 12-11-2012, 5:40 PM
Member
  Premium Subscriber
Premium Subscriber
Amateur Radio Operator
Amateur Radio
 
Join Date: Nov 2004
Location: Florida!
Posts: 1,425
Default

Make a note to yourself saying hey boss don't forget your cell phone if you leave to go some place.
__________________

Reply With Quote
  #4 (permalink)  
Old 12-11-2012, 5:46 PM
Member
  Premium Subscriber
Premium Subscriber
 
Join Date: Jan 2007
Posts: 1,872
Default

Quote:
Originally Posted by blantonl View Post
Sunday 12/9/2012 Outage Postmortem

1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.

Hmmmmm... No Bombay Sapphire Martini for you today!!

On second thought, I guess that really would be "cruel and unusual punishment".

Thanks for the report!

.
Reply With Quote
  #5 (permalink)  
Old 12-11-2012, 6:00 PM
mjbjr's Avatar
Member
   
Join Date: Dec 2009
Location: Macon,Ga USA
Posts: 409
Default

Sometimes i wish i could forget my phone lol
__________________
Macon,Georgia Police,Fire and Ems

Running off a Pro-197
Reply With Quote
Sponsored links
        
  #6 (permalink)  
Old 12-11-2012, 6:13 PM
k3td's Avatar
Member
  Premium Subscriber
Premium Subscriber
Amateur Radio Operator
Amateur Radio
 
Join Date: May 2003
Location: Georgetown, TX
Posts: 189
Default

Lindsay, thank you for the superb resource and excellent support!
__________________
Tad, K3TD
EM10dq
Reply With Quote
  #7 (permalink)  
Old 12-11-2012, 6:27 PM
kc8dhx's Avatar
Member
  Premium Subscriber
Premium Subscriber
Amateur Radio Operator
Amateur Radio
 
Join Date: May 2010
Location: Cannon Twp
Posts: 45
Default

Been there done that, unfortunately this is how we learn not to make the same mistakes twice as administrators. Thanks for the update and your service.
Reply With Quote
  #8 (permalink)  
Old 12-12-2012, 2:14 PM
Member
  Premium Subscriber
Premium Subscriber
 
Join Date: Sep 2009
Location: Rochester, NY
Posts: 113
Default

Lindsay, I DON'T envy your job! There's a responsibility that grows bigger with each and every member that joins here, and this has become a very big community. I don't play the 'one up man ship' that many do, but if I could make a suggestion: With the size that this thing has become, maybe have a back up for urgent notifications; maybe a second person. Things do happen; and personally, I do understand that (and hope everyone else does as well). Please don't beat yourself up too bad over this fluke. You and your people are doing a great job that I know I appreciate!
__________________
Dave Knapp

"An undependable radio and/or system is unsafer than having no radio at all"....
Reply With Quote
  #9 (permalink)  
Old 12-12-2012, 3:03 PM
Member
   
Join Date: Feb 2003
Posts: 1,087
Default

I never even noticed that it went down .... and I am sure that I am not the only one. So no big deal, but thanks for letting us know what happened.

As a side note .. I belong to another web site and they went down this weekend for like 2 days with "hardware problems". They didn't say much of anything to anyone, and really it is refreshing to see just the opposite happening here at RR. I just wish that the other site had as good support as RadioReference has.
__________________
Scanners: PSR-500, Pro 2004, Pro 2006, BC780XLT, BC246T
Software: Unitrunker, Etrunker *gotta love DOS ? PSREdit500

Scanning almost as long as Harry Shute
Reply With Quote
Sponsored links
  #10 (permalink)  
Old 12-13-2012, 7:03 AM
Dude111's Avatar
Member
   
Join Date: Aug 2009
Posts: 318
Default

Im sorry this happend guys,i didnt come to the site on sunday so i didnt realise (Otherwise i would emailed ya straight away) -- Im glad ya got it corrected
Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -5. The time now is 5:20 AM.


Powered by vBulletin® Version 3.8.2
Copyright ©2000 - 2014, vBulletin Solutions, Inc.
All information here is Copyright 2012 by RadioReference.com LLC and Lindsay C. Blanton III.Ad Management by RedTyger
Copyright 2011 by RadioReference.com LLC Privacy Policy  |  Terms and Conditions