So, I’ve just blocked msnbot (which is, I assume, the search spider for Microsoft Live Search) from indexing the Toolserver. Many spiders, such as Google and Yahoo!, index our website every day, and cause no problems; in fact, we don’t even notice them. Msnbot is different. Specifically, it seems to have no rate limiting. Microsoft claim it will only request pages around once every 10 seconds; in reality, it was making 5-10 requests per second. Unfortunately, the page in question was a slow CGI script, and msnbot seemed to have obtained a list of every possible parameter it could pass to the script, which it then did, as fast as possible, until the web server was so overloaded it could hardly serve user requests:
wolfsbane up 53+10:55, 1 user, load 53.93, 55.49, 55.22
It doesn’t seem to have noticed that it’s blocked. It’s still hammering away as fast as it can, and getting nothing but 403 in reply. I’ve even added it to robots.txt, but it doesn’t seem to have noticed that either yet. Fortunately, our web server is quite fast at returning 403, so the load is looking much happier:
wolfsbane up 53+11:58, 4 users, load 0.68, 1.15, 3.44
After I blocked it, I tried to find a contact at Microsoft to report the problem too—as the spider clearly isn’t behaving like they expect, I thought they might appreciate a warning. Well, I can now report that Live Search really don’t want to be contacted. The closest thing I could find to a contact form, linked from the “troubleshooting problems with msnbot” page, had a list of categories for me to choose from. None of them was even slightly related to search. Some Googling suggested that “msnbot@microsoft.com” might work, but nope (”Returned mail: user unknown”). There’s a feedback link on the MSN front page, but who knows where that would go, and whether the feedback would ever reach someone who could deal with it? (Certainly not me, as they clearly state that they won’t reply to your feedback.)
I gave up in the end. If someone reading this happens to have a contact at Microsoft who would be interested in this issue, please feel free to let them know. Otherwise, I imagine Live Search users will just have to live without the Toolserver.
PS: I know msnbot (supposedly) supports the Crawl-Delay parameter in robots.txt. But given what I’ve seen today, I don’t particularly want to rely on this, even if it does, some day, reload our robots.txt.
#1 by OverlordQ at June 9th, 2009
I mailed them back in 2005 about this same thing and commented “Blocked your netrange” they replied with “OMGPLZNO, try robots.txt” Tried that, their FAQ (at that time) said they rechecked robots.txt every day, but 3 days later, they were still hammering away so I blocked them outright with IPTables.