Contact Info / Websites
Some of you may have noticed that there were some unplanned outages in the last few weeks. After quite a bit of troubleshooting the issue has been resolved. Before getting to the problem and the eventual solution, here is a little background information.
Newgrounds uses a single-master/multilple-slave database configuration wherein nearly all read queries are handled by the slaves leaving the master to handle only the write queries. In this setup, once the queries are written to the master database, they are then propagated down to the slaves. As long as the slaves are able to keep up with both the requests from the webservers and the updates from the master database, everything works very well. It is when they get behind that problems can arise.
A few weeks ago, we began to see the database slaves getting substantially behind the master. This would be noticeable on the site in situations such as posting to a forum then being unable to see the post when the page was refreshed. In this example, upon submitting the post it would be written to the master database, but when the view page requested data from the slave, that post had not yet been propagated to the slave. The resulting page would be based on data that was up to several hours old.
This problem came out of nowhere and was rather unsettling. There was nothing popping up in the logs nor any substantial increase in system load. We began to look into optimizing scripts, updating the database software and even considered upgrading some of the hardware as nothing seemed to have any effect. Then came the outage on 8/8/12. This was rather significant as the site failed completely as opposed to merely running on old data. When I was putting everything back together, the slaves had to be completely cleared of data in preparation for a full import. It was then that I noticed the mysql process using up a large amount of system resources all while not actually hosting a database. This was clearly not a problem with the master/slave relationship, but rather a problem with mysql in general, or so I thought.
Upon further digging, it turns out that we had been hit by the leap-second bug. The problem was actually with the kernel and had nothing to do with mysql. While this had caused some major outages across the net on July 1st, it didn't seem to affect us immediately, instead occurring seemingly at random over the next several weeks. The quick fix is comically simple (date -s "`date`"), basically just setting the date to the current system date or to put it another way, poking the system clock with a stick. The permanent fix is to update the kernel which will be taken care of over the coming weeks. The effect was immediate and has completely solved the problem.
Anyway, back at it...
With all the snow we've gotten in recent weeks, we ran out of places to put it. We had a Bobcat pile it up as best it could, but it was still taking up a few spots. So, for the common good, it was decided that my Jeep could no longer occupy an empty space.
According to the Newgrounds staff page: "Tim spends a great deal of time on the phone, screaming at vendors for their incompetence." Well, AOL isn't technically one of our vendors, but close enough.
A few years ago, it was brought to my attention that AOL users were not receiving our nightly/weekly emails. We have a few different mail servers for various purposes, and upon further inspection, we found that the emails in question were all coming from our bulkmail server. This server's only task is to send out hundreds of thousands of emails, all solicited mind you, each night. To handle such issues, AOL has a Postmaster department which is easily contacted via a toll-free number. I placed the call and found that the IP address associated with our bulkmail server was indeed on their blacklist. They explained that there were a few things we needed to set up on our end (reverse dns, etc.) in order to get on their whitelist (basically the opposite of blacklist, emails are automatically accepted). After filling out the paperwork and fixing the various technical issues, the whitelisting was granted and we were back in business. While somewhat annoying, I'll admit it was a relatively painless process.
When I was informed late last year that once again we were having problems emailing AOL users, I figured that our whitelisting had expired or something to that effect. I checked our mail servers and found repeated "550 Mailbox not found" errors. According to AOL's own FAQ:
550 Mailbox not found: This error indicates that the AOL Member no longer exists on AOL or the address is misspelled.
Now, there are only two possible causes. Either every single person we're sending emails to cancelled their service all at once, or we're being blocked. Once again I called AOL Postmaster and explained the problem. After assuring them that the addresses in question did, in fact, exist, they asked what IP was sending messages in order to check its blacklist status. I provided the IP's for our main SMTP servers all of which came back clean. At that point they suggested that there was a problem with our mailserver. Rather than debate mailserver configuration options, I explained to them that even our corporate email addresses, which are hosted by Google, were being affected. After checking Google's IP against their blacklist, they suggested that I call Google to have them look into it and sent me the following email:
Dear Tim Miller,
Thank you for contacting the AOL Postmaster Helpdesk. This email should help you find a solution to your email delivery problem.
Please note: AOL always gives an error message during the SMTP transaction or in a bounce message for anti-spam functions.
The AOL error code format is: 554 or 3 digits:2-3 digits (Example 554 foo:b2)
Included are several suggestions on how to obtain the AOL error code if you are unable to obtain it within the bounce message.
Manual Telnet . You must utilize this method directly from the MTA receiving the block. The directions are located at: http://postmaster.aol.com/tools/telnet .html
SMTP Logs . Although you may have to set your SMTP logging to verbose or Debug mode you should be able to obtain the error code from your server logs.
Bounce Messages . Double check the sending mail box for bounce messages from AOL.
Once you obtain the error message, please contact our Postmaster Helpdesk.
Thank you for your time and patience.
Note the part where it says "AOL always gives an error message during the SMTP transaction or in a bounce message for anti-spam functions". I don't quite understand how a false "550 Mailbox not found" could be helpful when the address in question exists, but then again I'm a logical person.
Now, I knew full well that if Google were having a problem sending messages to AOL users, it would be on the homepage of CNN, but I figured I'd go though the motions with Google in the hopes that it would convince AOL to fix what I knew to be their problem. The response I promptly received from Google was:
Thanks for your response. We've investigated the issue and it seems that
messages are being blocked by aol.com. I'd suggest you to contact aol.com
support team with the message headers which they may need to investigate
Feel free to reply to this message if you require further assistance.
The Google Apps Team
With this obvious fact being verified, I called AOL back to give them an update. Even though I had an existing case, which I hoped included helpful notes, they promptly asked which IP I was sending email from. It took another 5 or so minutes to explain to them the problem all over again and to assure them that it wasn't a problem with google's email servers (which are, in fact, whitelisted). She suggested that I might not be providing her with the correct IP and asked me to send a blank email to email@example.com which would auto-reply with the IP from which the email was sent. I immediately sent the email but after 10 minutes without a reply she admitted that the "tool" was probably down. A decision was made to escalate the issue.
After 4 days passed, I received an email from Mahesh. He requested that I send an identical email to an @aol.com and an @postmaster.aol.com email address, both of which were his. He said that the Postmaster address was unfiltered and would definitely go through and asked that I forward him any error message received from the @aol.com address. This was an absolutely fantastic idea (and i'm not being sarcastic, it really was) and I was excited to speak to someone who seemingly understood the problem. I sent the messages and the @aol.com address was promptly kicked back with another 550 error which I forwarded back to his Postmaster account as per his request. That was on December 12th, 2007.
The holidays came and went without a single email from AOL Postmaster and in late January I decided to start calling every single day until this issue was resolved. Over the next several days, I spent hours on the phone going through the same questions/suggestions/responses over and over again; what IP are you connnecting from, it must be Google's problem, I'll be sure to escalate this case for you. Each time I would request a supervisor which I was told did not exist. Each time I asked if I could speak with second-level support directly which I was told was not possible. At the end of every call I'd remind them that I would be calling back each and every day until I received an email back from Mahesh, who I was also emailing nearly every day.
Finally, on January 30th, I received the following email:
I am currently working on this issue, as soon as I find a fix for this issue I will reply back to you.
Over the course of 49 days, he was able to type up a total of 23 words. Now, I'm no expert, but I think even a guy with no arms or legs could type more than that with just his nose. I replied and reminded him that this has been going on for months and requested daily updates until it was fixed. Here was his reply (note the poor grammar):
The reason for you getting this block is because, your domain (newgrounds.com) was listed with AOL in the past (free or vanity domain). You have moved it to other domain now. When you send email from your domain its comes from the internet and your getting a MAILBOX NOT FOUND error. I have escalated this issue to another dept/team who work on this, its going to take some time and our technicians are working on this issue. As soon as I get a reply from them I will let you know.
Appreciate your patience.
I've not since heard back from him. I still call nearly every day asking for updates, or to talk to Mahesh directly or to be transferred to anyone with the ability to fix this, but the call always goes the same way. Here is a paraphrased excerpt:
AOL Postmaster: Hello, how can I help you.
Me: I have an existing ticket number: 123456.
AOL Postmaster: What IP's are you connecting from.
Me: If you read the case notes you'll see that isn't the problem.
AOL Postmaster: Ok, I see. Well, the issue is still being worked on, is there something I can do for you?
Me: Actually, I was wondering if there was anything I could do for you.
AOL Postmaster: How do you mean?
Me: Well, since Mahesh and I agree that it's a problem on your end yet you are seemingly incapable of fixing it, I'd just like to know what I can do differently to help you better.
AOL Postmaster: At this point you should just wait for a response from Mahesh.
Me: I'm not sure you understand, I've tried waiting. What I'm asking is what I can do DIFFERENTLY to better help you fix YOUR problem.
I won't go on, but rest assured it gets worse. From my experience, I think I have their "support" structure pretty well mapped out. There is only one phone number and it goes directly to an outsourced call-center. Those people only have the ability to check IP's against their blacklist and recommend ways to get off of it. Now, admittedly, that probably solves about 99% of the calls that come in. For the other calls, they "escalate" the case to the next level. What this means is that they send an internal email to a catch-all address which is checked by the AOL Postmaster support team. There is no way for the call-center to transfer a call to the support team nor is there a way for the call-center to get in touch with any specific individual, only that one catch-all internal email address. From the end-user point of view, there is no way to check on the status of your case other than to call the call-center (whose primary goal is to put the blame on you) or send emails which are never replied to.
So, I'm asking all AOL users to please cancel their AOL accounts. I know from my point of view, every time I see an @aol.com email address I wonder why in the world would someone still be using AOL. It seems like a modern-day scarlet letter to me.
Seriously people, just quit AOL already.
The other night, four of us (Chris, Lisa, Ross and I) decided to grab a bite to eat after Comic-Con had closed for the day. As we strolled through San Diego'sGaslamp Quarter, we were on the lookout for an eatery with little or no wait. As such, we figured it would be best to try one of the side streets as the crowds seemed to be sticking to the main drag. The chosen street provided us with two options, Chinese or American Pub fare, and we decided on the latter.
Walking into the restaurant, it was amazingly empty. There were a few people sitting at the bar and two or three groups of 6-8 at the tables, but otherwise were thrilled with our apparent luck. After a brief look at the menu, we decided to take a seat.
There was only one waitress handling the tables but with so few customers we didn't expect any problems. After about 5 minutes she came over to take our drink orders and when she quickly rambled off the beer list I was convinced of her competency. Oh man, was I wrong.
Ten beverage-free minutes went by when Ross decided he'd grown tired of waiting for our drinks and went to get his own. Five minutes of Watching him enjoy his Hefeweizen prompted me to take action. I walked over to another table she'd been visiting quite often and reminded her that we were still without our drinks. She apologized and walked away, promising to remedy the situation. Yet another 5 minutes went by when she returned with 3, not 4, drinks. Of course, the missing Fat Tire Seasonal was mine so I had to remind her once again. When she finally returned with my beer we were ready to order and so we did. This was the beginning of the end.
Over the course of the next 75 minutes, we didn't see our food nor our waitress. Well, that's not true, we did see her, just not at our table. She was busy running back and forth, seemingly aimlessly, getting very little done. Lisa pointed out how she would walk halfway across the room and then quickly walk the other way obviously having forgotten something. This wouldn't be noteworthy except for the fact that she did it 10-20 times, literally. We never got the chance to order a refill on soda or another beer from their decent selection. We saw customers come in, sit for 20 minutes and then get up and leave, all without getting so much as a 'hello' from her. It's important to note that during this whole time, the place was nowhere near 'busy' and she'd visited other tables numerous times, just not ours. Discussions ranged from Comic-Con in general, to how hungry we were, to how poor her waiting skills were, to whether or not we should just leave.
This notion of leaving was finally unanimously agreed upon and we started shifting out the chair/booth when around the corner was our waitress with our food. She set down two plates, asked Ross what salad dressing he'd like, then left to get the rest. After a minute or two, she returned with one more plate leaving me as the only one without food. It was at this point she did the unthinkable. Before going back to get my food, or evenacknowledging the lack thereof, she went to another table to take their order. I was beside myself. My food was eventually brought a few minutes later by someone else and we proceeded to eat our mediocre sandwiches/salad and cold fries, all without any refills or salad dressing.
As we were finishing up our meals, she finally came back and rather than waiting another hour for dessert, we simply asked for the check. While she was busy printing it up, our discussion turned back to her tip. Now, I must tell you that I'd worked in the restaurant/service industry for several years and I'm generally an excellent tipper. I have no problem giving 25-30% for excellent service and won't give below 15% for even poor service, but this had been unbelievable . We immediately agreed that she deserved nothing, absolutely nothing. But we wanted to send a message, and by leaving exact change, she might think we just figured the tip was included. Ross came up with our final plan which was to leave a penny, yes $0.01, extra. So for the first time in my life, I walked out of a restaurant stiffing the waitress with $55.46 on a $55.45 bill.
Back in November of '06 it was decided that the entire office would be attending Comic-Con. I took it upon myself to begin the search for flights. For those of you who don't know me, I take pride in "traveling right" (I'd like to think I coined that before Expedia, but as I have no proof I'll have to let that one slide). As such, when I came across roundtrip airfare from PHL-SAN for the low-low price of $219, I knew it was time to pull the trigger. I quickly snagged 2 tickets for Tom and I on Sunday and 6 for the rest on Wednesday. Even better, I got a 3% discount for using an American Express card, lowering the price to $213. Our reservations were in place and there was much rejoicing. Ok, well, that's a lie. There was much rejoicing among those of us excited to go, and groans from the rest. I'll let everyone figure out which responses match up with which staff members :)
Fast forward to two months ago when I received a call from Delta informing me of a flight change that would cause Tom and I to miss our connecting flight. I spent 5 minutes on the phone discussing our options and decided to take a later flight with a layover in SLC as they had a nice lounge to relax in. When I inquired about seats they said they were unable to assign them at that time, but assured me that I would be given seats upon check-in. While I didn't like the sound of that, I always end up changing my seats to the emergency exit row at the start of the 24-hour check-in window anyway, so I wasn't bothered.
The 24-hour window began as I was hiking down the Watkins Glen trail in the finger lakes region of NY. As such, I stepped to the side of the trail, pulled my laptop out (which I backpacked up), plugged in my cellular modem and logged into Delta's webpage to ensure proper seating for Tom and I. Now, if you're thinking this sounds extreme or a bit OCD-like, I'd have to say yes and yes, but I digress. I typed in our confirmation number and navigated to the seat selection stage but instead was taken directly to the print page where our tickets said "see gate attendant". It was at this point where I became concerned.
At the exit of the trail I placed a call to the reservations line and pleaded my case. The representative understood our situation and said she would go ahead and give us seats, exit row seats at that. After repeated attempts to do so, the computer wasn't allowing her so I asked for her supervisor. While I was then speaking to someone else, the story remained the same. As the flight was now under 24 hours out, it was in the hands of the airport and there was nothing she could do over the phone. I explained to her that in decades of flying, I'd never stepped into an airport without a confirmed seat, that I didn't feel comfortable doing so and that it sounded like the flight was oversold. She repeatedly insisted that my tickets were confirmed, that there was plenty of space and that it would be taken care of at the ticket counter, NOT the gate. After 60 minutes of back and forth, I gave up (this is EXTREMELY rare) and accepted my fate.
As Tom and I sat in traffic in the Terminal E departures lane at the airport, he asked why there was such a backup. I quickly responded that this Terminal was shared with Southwest and it would certainly be "amateur hour". I'll save that rant for another day, but let's just say while I'm willing to take extreme measures to ensure comfortable travel, any group of flyers who are content to sit on the floor at the front of the line for 2 hours to get a good seat for a 45 minute flight aren't the type of people I want to travel with. Once we managed to get through the horde of idiots, we stepped up to the self check-in computer and slid our card. The computer quickly found the flight but still refused to let us select any seat, for either flight. I promptly alerted the attendant who suggested that we take care of this at the gate. Tom's agreed and said "just hit print". While that would have been a logical thing to do, I outright refused to leave that counter without a seat, any seat. I'm not sure if he was sympathetic to our situation or it was so busy he figured complying would be the fastest way to disperse with us, but we walked away from the counter with seats for both flights, and exit row seats at that. Normally at this point, I'd proceed to the airport lounge for a complimentary beverage, but since Delta made the genius decision of placing their lounge outside of security, we thought it best to get through that gauntlet as soon as possible and just wait in the terminal for our flight.
As we're waiting by our gate, we notice there's a line forming at the counter. I point out to Tom that that would have been us and that our seats were a far more comfortable place to be. Then comes the kicker, the gate agent comes on the microphone and offers $400 and a hotel room to anyone willing to take a flight out the next day. In other words, the previous day the twat on the phone LIED to me and the flight was indeed OVERSOLD. I finally felt vindicated and got a solid "yay for traveling with Tim" from Tom.
Sure, it seems that we won the fight that day, but I have a sneaking suspicion that they purposely put that seat-kicking 3 year old behind me.
As I stood at the counter of the local coffee shop at 9:30 this morning waiting for my take-out bagel sandwich, the rest of the office walked through the door. We said "hello" after which they proceeded to look around for a place to sit. I quickly reminded them that the site was going down for the redesign at 10am, as was the stated plan. This remark was quickly shrugged off as they sat down for breakfast. I grabbed my sandwich and returned to the office. The clock struck 10am and there I sat, in an empty office (except for Josh from the Behemoth), as Newgrounds.com came to a screeching halt. Sure, I could have waited 15 minutes for everyone else to get back, but I figured that while it would take hours to complete the launch, we were at least going to start on time. Well, that, and I'm an asshole.