We woke this morning to find an impressive 20,000+ new mails in the domcomp inbox.
On an average day, we receive tens of user queries, and an equal amount of spam. Needless to say, this was out of the ordinary. We assumed that domcomp's popularity hadn’t gone nuclear overnight nor had someone decided that we were the world’s best consumers of Viagra.
More alarming was when we realised new mail were still flooding in at over 100+ every second, staggered by the now struggling mail server. Our usual mail client was in overdrive, flickering as each new batch arrived. We could barely click and view the individual mails to investigate.
We saw that the majority of arrivals belonged to one of
Domcomp error logs, which we send ourselves when something goes amiss with the site
Bounce notifications from AWS, which we use to send the above error mails
Delivery failures from our mail server
The error logs were the best place to start as the other two were surely caused by it. We found all the errors mails were repeats of MongoDB not being able to handle creations of a document of over 16MBs.
A quick tour on StackOverflow told us that MongoDB has a 16MB hard limit on its document size, which we actually were not aware of before. When people run into the limit, the response is usually “If you need such a big doc then you’re doing it wrong”. And, we actually tend to agree with that. So, the question is, why was domcomp creating such a huge document for?
Our domain model (excuse the pun) of registrars, domains extensions, new/transfer/renewal prices etc. is a perfect fit for a document DB, and we chose Mongo. As well as domain price and availability data, we also store visitors’ interactions with the site within each session. This allows us to both improve domcomp based on user behaviour as well as track suspicious traffic, such as from scrapers or malicious bots (and that’s an interesting topic for another time).
Separately, we’ve set domcomp up to send automated-mails, complete with stacktraces, should Exceptions occur. It has been useful to receive instant notifications or any errors such as an issue with a price monitor.
It turns out the '16MB limit' stacktrace was a result of one of the session-tracking documents, in Mongo, being filled. This, however, should never happen under normal circumstances, as the data being written is miniscule. A normal user session, even if it lasted hours, would take up next to no memory. So what happened here?
Tracing back through the session logs showed that this particular user’s session involved making requests to one of domcomp's endpoints, repeatedly, at a rate of many hundreds of times per second. Throughout the night. This eventually filled that session’s doc and, as it continued with it’s requests, caused our error notification system to repeatedly send out alert emails at the same rate.
As domcomp inadvertently spammed us with tens of thousands of emails, the mail server decided this was too much to handle and failed to deliver many of them. This caused AWS to further spam us with just as many email bounce notifications. By the time we realised what had happened, our inbox had already grown tenfold again.
We quickly disabled domcomp from sending mails under this particular error type. Even so, our inbox still grew for the rest of the day from mails slowly trickling in, as they eventually were delivered.
So who decided to spam domcomp? As far as we could trace, the requests came from a single Comcast user on the west coast. He/she clearly wrote a spamming script, turned it on and went to sleep. As to ‘why’, your guess is as good as ours.
So it turns out that domcomp can handle large amounts of traffic pretty well, which is actually not surprising. However, our oversight in design of the error-tracking framework ironically caused a pretty major annoyance. We spent much of the afternoon deleting the millions of emails. What made matters worse is that we had been exchanging emails with many other websites and companies about partnership ideas, the night before. Therefore, we had no idea whether their replies had come through or not, and never mind the disruption to responding to our usual users.
Needless to say, we’ve now fixed the flaw. This was an unexpected, but interesting, experience. And we’re sure it won’t be the last one of its kind.