Archive for category Uncategorized
Wherein the Toolserver becomes more reliable
Posted by river in Uncategorized on November 17th, 2009
Some months ago, the Wikimedia Foundation approved a $40,000 grant to Wikimedia Deutschland, for the purpose of improving Toolserver reliability. We’ve now implemented the first part of this plan: redundant NFS and LDAP.
When we first proposed the grant, the plan (which you can read more about at the above link) was to purchase 3 database servers, which we would use to provide a redundant backup for the 3 current servers. However, before we made the purchase, we realised that for the same amount of money, we could purchase 2 database servers, 2 smaller servers and a disk array. The Foundation approved this change, and that’s what we ended up buying.
The purpose of the two small servers and array was to provide redundant service for NFS and LDAP. These services are critical to the platform operation; if either is offline, the entire platform is down. Previously, both were hosted on a single server (hyacinth), which meant the entire Toolserver depended on this server being up. As well as hurting reliability, this made it very difficult to do any maintenance on that server.
Now, however, the NFS and LDAP data is stored on the disk array, which is connected to two servers (turnera and damiana) running Solaris Cluster software. If one server breaks, or we need to do maintenance on it, the services are automatically moved to the other server, with no interruption in service. The array itself has two redundant, independent controllers, making failure quite unlikely.
The previous NFS/LDAP server, which is now idle, has exactly the same specification as a database server. We will be using this as the third redundant database (along with the two we purchased with the grant) to provide redundant access to the MySQL databases. More news on that later.
Platform outage on August 24th
Posted by river in Uncategorized on August 25th, 2009
From afternoon on August 24th until August 25th, the Toolserver was offline due to an unscheduled outage. Everything is now back online, and the technical details of the outage are documented here for anyone who’s interested.
Wikimedia Foundation Grants $40,000 to the Toolserver
Posted by daniel in Uncategorized on July 29th, 2009
The Wikimedia Foundation has approved a grant of $40,000 towards improving the Toolserver’s reliability. We requested the grant in April, and we are very happy it worked out. The background of the grant is that the most central feature of the Toolserver, live replication of the nearly 800 wiki databases, is far too shaky. If it breaks for a day or so, or we have any kind of corruption, we need to import a full new dump, causing days and weeks of outdated information for Toolserver users (and for the users of Toolserver users’ tools). It also means that during such times, there is no up to date off-site backup of the wiki databases.
To improve this situation, we plan to buy three new database servers, so we can keep two copies of each database, instead of just one. This way, one copy will remain available when the other breaks, and we will be able to fix things without too much interruption. The new servers will very likely be the same as our other newer database servers, namely, Sun Fire X4250s with 32GB RAM and 16 internal disks with 146 GB each. We hope to have these online some time in September or October. This should greatly improve the availability of live replication, and thus of any tools relying on real time information.
Wherein msnbot behaves badly, and is banished
Posted by river in Uncategorized on June 9th, 2009
So, I’ve just blocked msnbot (which is, I assume, the search spider for Microsoft Live Search) from indexing the Toolserver. Many spiders, such as Google and Yahoo!, index our website every day, and cause no problems; in fact, we don’t even notice them. Msnbot is different. Specifically, it seems to have no rate limiting. Microsoft claim it will only request pages around once every 10 seconds; in reality, it was making 5-10 requests per second. Unfortunately, the page in question was a slow CGI script, and msnbot seemed to have obtained a list of every possible parameter it could pass to the script, which it then did, as fast as possible, until the web server was so overloaded it could hardly serve user requests:
wolfsbane up 53+10:55, 1 user, load 53.93, 55.49, 55.22
It doesn’t seem to have noticed that it’s blocked. It’s still hammering away as fast as it can, and getting nothing but 403 in reply. I’ve even added it to robots.txt, but it doesn’t seem to have noticed that either yet. Fortunately, our web server is quite fast at returning 403, so the load is looking much happier:
wolfsbane up 53+11:58, 4 users, load 0.68, 1.15, 3.44
After I blocked it, I tried to find a contact at Microsoft to report the problem too—as the spider clearly isn’t behaving like they expect, I thought they might appreciate a warning. Well, I can now report that Live Search really don’t want to be contacted. The closest thing I could find to a contact form, linked from the “troubleshooting problems with msnbot” page, had a list of categories for me to choose from. None of them was even slightly related to search. Some Googling suggested that “msnbot@microsoft.com” might work, but nope (“Returned mail: user unknown”). There’s a feedback link on the MSN front page, but who knows where that would go, and whether the feedback would ever reach someone who could deal with it? (Certainly not me, as they clearly state that they won’t reply to your feedback.)
I gave up in the end. If someone reading this happens to have a contact at Microsoft who would be interested in this issue, please feel free to let them know. Otherwise, I imagine Live Search users will just have to live without the Toolserver.
PS: I know msnbot (supposedly) supports the Crawl-Delay parameter in robots.txt. But given what I’ve seen today, I don’t particularly want to rely on this, even if it does, some day, reload our robots.txt.