Hi Folks,
We now definitively know what took us down during the earthquakes, See Page 9, middle of the page of the PDF attachment below. This is a transcript of the CNN newsroom during the earthquakes on the east cost last week. On air, CNN said:
That single statement, live on CNN, drove tens of thousands of simultaneous requests to our infrastructure within a few seconds knocking out our master database server. We don't know exactly why, but it was most likely due to the server running out of memory trying to immediately serve all those simultaneous requests.
What have we done to help prevent this in the future?
1) We've implemented throttling at our proxy level (the front end to the Web site) to make sure that we limit the number of requests per second to levels that our back end web servers and databases are currently provisioned to support. As our environment provisions new resources or auto-scales, we'll up that number accordingly. During a huge flood in traffic, the behavior that the visitor would see would be slightly slower response times based on the traffic coming in, but it won't crash the site. And, we'll be ready to increase servers within minutes to handle the load instead of responding to a down Web site.
2) We've provisioned more powerful, and more full time database replica servers which serve read-only content to the site. These servers are responsible for serving most of the content you see. In hindsight, we were not adequately provisioned in this area and that was my fault. But, we've made a good investment in much more database server capacity and that will help us in the long run.
Finally, we have a new site software release scheduled to go in the next few weeks that will great improve the performance of the site overall, but also the readability and usability. Stay tuned for more details on this - our admins are actively beta testing the new site format and features.
Hopefully this gives you guys some perspective on what happened and what we are doing to prevent further issues such as the one experienced during the earhquake. While we were only down for about 40 minutes, we had tens of thousands of new people clamoring to see the site and we weren't able to give them the taste of the sweet nectar of RadioReference.
To all the visitors, members, admins and senior leadership team, many thanks for your help this weekend during the hurricane and your efforts to make our platform awesome.
Warm regards,
Lindsay
We now definitively know what took us down during the earthquakes, See Page 9, middle of the page of the PDF attachment below. This is a transcript of the CNN newsroom during the earthquakes on the east cost last week. On air, CNN said:
One more thing, I was on RadioReference.com, one of my favorite Web sites. You guys should listen to scanners all over the world, basically. And we do have some reports in Richmond, Virginia, of people smelling natural gas. That's probably not unusual if you start
shaking the ground. Some of the pipes may start losing some of their structural rigidity. Especially those nat gas pipes.
That single statement, live on CNN, drove tens of thousands of simultaneous requests to our infrastructure within a few seconds knocking out our master database server. We don't know exactly why, but it was most likely due to the server running out of memory trying to immediately serve all those simultaneous requests.
What have we done to help prevent this in the future?
1) We've implemented throttling at our proxy level (the front end to the Web site) to make sure that we limit the number of requests per second to levels that our back end web servers and databases are currently provisioned to support. As our environment provisions new resources or auto-scales, we'll up that number accordingly. During a huge flood in traffic, the behavior that the visitor would see would be slightly slower response times based on the traffic coming in, but it won't crash the site. And, we'll be ready to increase servers within minutes to handle the load instead of responding to a down Web site.
2) We've provisioned more powerful, and more full time database replica servers which serve read-only content to the site. These servers are responsible for serving most of the content you see. In hindsight, we were not adequately provisioned in this area and that was my fault. But, we've made a good investment in much more database server capacity and that will help us in the long run.
Finally, we have a new site software release scheduled to go in the next few weeks that will great improve the performance of the site overall, but also the readability and usability. Stay tuned for more details on this - our admins are actively beta testing the new site format and features.
Hopefully this gives you guys some perspective on what happened and what we are doing to prevent further issues such as the one experienced during the earhquake. While we were only down for about 40 minutes, we had tens of thousands of new people clamoring to see the site and we weren't able to give them the taste of the sweet nectar of RadioReference.
To all the visitors, members, admins and senior leadership team, many thanks for your help this weekend during the hurricane and your efforts to make our platform awesome.
Warm regards,
Lindsay