Be a Supporter!

TimTim

Main News Favorites Reviews Stats 31 Fans
Follow Tim

Contact Info / Websites

Entry #6

Random Outages

2012-08-20 16:00:29 by Tim

Some of you may have noticed that there were some unplanned outages in the last few weeks. After quite a bit of troubleshooting the issue has been resolved. Before getting to the problem and the eventual solution, here is a little background information.

Newgrounds uses a single-master/multilple-slave database configuration wherein nearly all read queries are handled by the slaves leaving the master to handle only the write queries. In this setup, once the queries are written to the master database, they are then propagated down to the slaves. As long as the slaves are able to keep up with both the requests from the webservers and the updates from the master database, everything works very well. It is when they get behind that problems can arise.

A few weeks ago, we began to see the database slaves getting substantially behind the master. This would be noticeable on the site in situations such as posting to a forum then being unable to see the post when the page was refreshed. In this example, upon submitting the post it would be written to the master database, but when the view page requested data from the slave, that post had not yet been propagated to the slave. The resulting page would be based on data that was up to several hours old.

This problem came out of nowhere and was rather unsettling. There was nothing popping up in the logs nor any substantial increase in system load. We began to look into optimizing scripts, updating the database software and even considered upgrading some of the hardware as nothing seemed to have any effect. Then came the outage on 8/8/12. This was rather significant as the site failed completely as opposed to merely running on old data. When I was putting everything back together, the slaves had to be completely cleared of data in preparation for a full import. It was then that I noticed the mysql process using up a large amount of system resources all while not actually hosting a database. This was clearly not a problem with the master/slave relationship, but rather a problem with mysql in general, or so I thought.

Upon further digging, it turns out that we had been hit by the leap-second bug. The problem was actually with the kernel and had nothing to do with mysql. While this had caused some major outages across the net on July 1st, it didn't seem to affect us immediately, instead occurring seemingly at random over the next several weeks. The quick fix is comically simple (date -s "`date`"), basically just setting the date to the current system date or to put it another way, poking the system clock with a stick. The permanent fix is to update the kernel which will be taken care of over the coming weeks. The effect was immediate and has completely solved the problem.

Anyway, back at it...


Comments

You must be logged in to comment on this post.


supergandhi64supergandhi64

2013-11-26 09:55:36

IF U WERE KILLED TOMORROW, I WOULDNT GO 2 UR FUNERAL CUZ ID B N JAIL 4 KILLIN DA PERSON THAT KILLED U!
......__________________
...../_==o;;;;;;;;______[]
.....), ---.(_(__) /
....// (..) ), ----"
...//___//
..//___//
WE TRUE HOMIES
WE RIDE TOGETHER
WE DIE TOGETHER

send this GUN to everyone you care about including me if you care. C how many times you get this, if you get a 13 your A TRUE HOMIE


mandogmandog

2013-06-02 15:59:20

On other news I found the shuttle from drexelshaft http://web.archive.org/web/20020605055850/http ://www.drexelshaft.com/shuttle.html


CyberdevilCyberdevil

2013-02-07 08:22:30

Not that I understand exactly how the problem was solved or exactly what it was, but it's good to hear that things will be back to normal again!


VicariousEVicariousE

2012-11-20 04:39:49

I had a boat and roof outage due to hurricane sandy. Customer support was many miles away.


daethdraindaethdrain

2012-11-02 19:21:45

I had a power outage due to hurricane sandy


LuwanoLuwano

2012-08-21 21:39:11

Wow, you explained that so well that even I understood it. You are a good story teller by the way and should post more interesting stories if you find some time between the more important stuff you do for NG, of course. I always wondered if AOL ever got back to you again. lol


VicariousEVicariousE

2012-08-21 04:18:16

Good after action report. Most of the remaining issues seem to be software: missing pages and links, search fields and maturity rating filters on all result pages.... I had a list, but it seems the initial birthing pains are finally behind us - good show!


JonasJonas

2012-08-20 19:15:37

Tim, there's very few problems that can't be resolved with proper application of stick.


VidGameDudeVidGameDude

2012-08-20 17:30:32

Newgrounds runs on an s/m program?
oh you guys and your subliminal coding systems.
But on serious note, it's great the problem was resolved, thanks for the quick and diligent fix.


SteranceSterance

2012-08-20 17:21:26

well played tim. sometimes its as simple as a missing semicolon at the end of your structure... or something like that


LokiLoki

2012-08-20 17:01:50

Tim is the best chef / server engineer ever.


MiaMia

2012-08-20 16:40:13

Interesting. I noticed unpublishing projects is now here!