Sunday 12/9/2012 Outage Postmortem
As many of you are aware, we had an outage on Sunday afternoon that lasted approximately 3 hours. At that time, all Web access to RadioReference.com and Broadcastify.com was rendered inaccessible.
The cause:
An invalid proxy entry for read only database queries on our Web servers. We use a piece of software called haproxy to load balance and proxy all read only database queries to our back-end database servers. This allows us to bring database servers online and offline without impacting Web users. In this case the proxy believed that a server was up and running when it in fact wasn't, thus sending queries that "hung" and quickly brought the Web servers to a halt.
The reason it took so long to get fixed:
1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.
2) After realizing the site was down, problem determination took additional time as the issue was very tiny, but complex, and was difficult to isolate (translation: I had no idea what was actually wrong for about 1.5 hours). There were a lot of steps and items to check to isolate the root cause.
Going forward:
1) We will better keep an eye on receiving and reacting to alerts
2) We will evaluate our database haproxy configurations on our Web servers to prevent this from occurring again.
TL;DR: Web servers went down due to misconfiguration, administrator (me) didn't respond for almost 1.5 hours on Sunday afternoon to alerts, took another 1.5 hours to fix.
As many of you are aware, we had an outage on Sunday afternoon that lasted approximately 3 hours. At that time, all Web access to RadioReference.com and Broadcastify.com was rendered inaccessible.
The cause:
An invalid proxy entry for read only database queries on our Web servers. We use a piece of software called haproxy to load balance and proxy all read only database queries to our back-end database servers. This allows us to bring database servers online and offline without impacting Web users. In this case the proxy believed that a server was up and running when it in fact wasn't, thus sending queries that "hung" and quickly brought the Web servers to a halt.
The reason it took so long to get fixed:
1) Alerts that the site went down went unanswered for almost 1.5 hours. The person responsible (me) did not have his cell phone on his person during the outage, and when returning home received a text message from one of our senior managers querying what the problem was. Inexcusable, but it happened.
2) After realizing the site was down, problem determination took additional time as the issue was very tiny, but complex, and was difficult to isolate (translation: I had no idea what was actually wrong for about 1.5 hours). There were a lot of steps and items to check to isolate the root cause.
Going forward:
1) We will better keep an eye on receiving and reacting to alerts
2) We will evaluate our database haproxy configurations on our Web servers to prevent this from occurring again.
TL;DR: Web servers went down due to misconfiguration, administrator (me) didn't respond for almost 1.5 hours on Sunday afternoon to alerts, took another 1.5 hours to fix.