Contact Info / Websites
Some of you may have noticed that there were some unplanned outages in the last few weeks. After quite a bit of troubleshooting the issue has been resolved. Before getting to the problem and the eventual solution, here is a little background information.
Newgrounds uses a single-master/multilple-slave database configuration wherein nearly all read queries are handled by the slaves leaving the master to handle only the write queries. In this setup, once the queries are written to the master database, they are then propagated down to the slaves. As long as the slaves are able to keep up with both the requests from the webservers and the updates from the master database, everything works very well. It is when they get behind that problems can arise.
A few weeks ago, we began to see the database slaves getting substantially behind the master. This would be noticeable on the site in situations such as posting to a forum then being unable to see the post when the page was refreshed. In this example, upon submitting the post it would be written to the master database, but when the view page requested data from the slave, that post had not yet been propagated to the slave. The resulting page would be based on data that was up to several hours old.
This problem came out of nowhere and was rather unsettling. There was nothing popping up in the logs nor any substantial increase in system load. We began to look into optimizing scripts, updating the database software and even considered upgrading some of the hardware as nothing seemed to have any effect. Then came the outage on 8/8/12. This was rather significant as the site failed completely as opposed to merely running on old data. When I was putting everything back together, the slaves had to be completely cleared of data in preparation for a full import. It was then that I noticed the mysql process using up a large amount of system resources all while not actually hosting a database. This was clearly not a problem with the master/slave relationship, but rather a problem with mysql in general, or so I thought.
Upon further digging, it turns out that we had been hit by the leap-second bug. The problem was actually with the kernel and had nothing to do with mysql. While this had caused some major outages across the net on July 1st, it didn't seem to affect us immediately, instead occurring seemingly at random over the next several weeks. The quick fix is comically simple (date -s "`date`"), basically just setting the date to the current system date or to put it another way, poking the system clock with a stick. The permanent fix is to update the kernel which will be taken care of over the coming weeks. The effect was immediate and has completely solved the problem.
Anyway, back at it...