RR down for 95 min.

Status
Not open for further replies.

w2xq

Mentor
Joined
Jul 13, 2004
Messages
2,363
Location
Burlington County, NJ
RadioReference via Amazon went down Mon Jul 16 at 2:00 pm EDT. Amazon hosts 72.21 and 216.182 were the last viable hops in the Amazon server farm. The service was restored at 3:35 pm.

Looks like Amazon should be hiring IT people.
 

N1XDS

ÆS Ø
Joined
Nov 3, 2004
Messages
2,014
That's why I had problems signing in to the forums...Seems now everything is fixed and fast.
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,418
Location
San Antonio, Whitefish, New Orleans
Looks like Amazon should be hiring IT people.

More like, looks like RadioReference should be hiring people :) This was our fault.

What happened is that we had a database slave server, which is one of many that serve read only requests, that went into an unknown inconsistent state. This caused Web requests for authenticated users (logged in members) to hang up. Subsequently, all of our Web servers eventually got bogged down with all the hung connections and were not be available to serve new requests to anyone

The reason why we were down for more than an hour was because we had a hard time determining exactly what the root cause of the problem was and how to fix it. We started by taking the entire site offline, then checking each of the 6 Web/api servers, 4 database servers. After about 60 minutes we were finally able to begin to restore full service, and about 120 minutes after the initial outage we had all RR services back fully available (www, forums, wiki, archive servers, audio servers and stats etc).

There are a lot of moving parts to RadioReference.com - and we were obviously as a team frustrated today about what happened.

EDIT: Thanks folks for being patient while we worked though this, I really appreciate it.
 
Last edited:

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,418
Location
San Antonio, Whitefish, New Orleans
By the way, just so everyone knows, the infrastructure that makes up RadioReference.com includes the following:

(1) proxy server (load balancer)
(6) Web/API servers
(1) memcache cluster
(5) Database servers - 1 master and n+1 slave servers( currently 2 slave servers), 1 backup database server, one FCC database server
(1) NFS / Administrative Server
(8) audio servers (relay and master servers) - we can scale this to 8+n+1 in the case of a major event
(5) audio archive servers
(1) development server
(1) mail server (soon to be retired)
 
Last edited:

w2xq

Mentor
Joined
Jul 13, 2004
Messages
2,363
Location
Burlington County, NJ
No problem with patience here. Been around this stuff more than 35 years. I -always- look at a traceroute before saying the web site itself is down. The ip identifications were all Amazon; RR must have been the first unidentifiable hop 14. Sorry, Amazon... :evil:
 

blantonl

Founder and CEO
Staff member
Super Moderator
Joined
Dec 9, 2000
Messages
11,418
Location
San Antonio, Whitefish, New Orleans
No worries! :) It was a really good thing that this happened today instead of yesterday, because I was on the Blackfoot River in western Montana fly fishing yesterday. So I was totally off the grid for about 36 hours.
 

n5ims

Member
Joined
Jul 25, 2004
Messages
3,993
No worries! :) It was a really good thing that this happened today instead of yesterday, because I was on the Blackfoot River in western Montana fly fishing yesterday. So I was totally off the grid for about 36 hours.

So it was the trout that hacked into the system and brought things down to get you off the river then. LOL! Glad things came up fairly easily (not that it's ever easy!) and quickly.
 

wx5uif

Member
Premium Subscriber
Joined
Aug 24, 2006
Messages
834
Location
Broken Arrow, OK
And here I thought that another hobby (surfing RR) went to being encrypted...
 

Attachments

  • Untitled.jpg
    Untitled.jpg
    159.7 KB · Views: 288

N1XDS

ÆS Ø
Joined
Nov 3, 2004
Messages
2,014
I figured it was a server problem causing some of the problems. Lindsay was any of the server problems cause the login not to work? I emailed earlier got a response from someone saying nothing was wrong.
 
Status
Not open for further replies.
Top