Within just 12 hours after the public beta launch of Union Station we were experiencing performance issues. Users noticed that their data wasn’t showing up and that the site is slow. We tried to expand our capacity on March 3 without taking it offline, but eventually we were left with no choice but to do so anyway. Union Station is now back online and we’ve expanded our capacity three fold. Being forced to take it offline for scaling so soon is both a curse and a blessing. On the one hand we deeply regret the interruption of service and the inconvenience that it has caused. On the other hand we are glad that so many people are interested in Union Station that caused scaling issues on day 1.
In this blog post we will explain:
- The events that had led to the scaling issues.
- What capacity we had before scaling, how our infrastructure looked like and what kind of traffic we were expecting for the initial launch.
- How we’ve scaled Union Station now and how our infrastructure looks like now.
- What we’ve learned from it and what we will do to prevent similar problems in the future.
Preparation work before day 1
Even before the launch, our software architecture was designed to be redundant (no single point of failure) and to scale horizontally across multiple machines if necessary. The most important components are the web app, MySQL, MongoDB (NoSQL database), RabbitMQ (queuing server) and our background workers. The web app receives incoming data, validates them and puts them in RabbitMQ queues. The background workers listen on the RabbitMQ queues, transform the data into more usable forms and index them into MongoDB. The bulk of the data is stored in MongoDB while MySQL is used to for storing small data sets.
Should scaling ever be necessary then every component can be scaled to multiple machines. The web app is already trivially scalable: it is written in Ruby so we can just add more Phusion Passenger instances and hook them behind the HTTP load balancer. The background workers and RabbitMQ are also very easy to scale: each worker is stateless and can listen from multiple RabbitMQ queues so we can just add more workers and RabbitMQ instances indefinitely. Traditional RDBMSes are very hard to write-scale across multiple servers and typically require sharding at the application level. This is the primary reason why we chose MongoDB for storing the bulk of the data: it allows easy sharding with virtually no application-level changes. It allows easy replication and its schemaless nature fits our data very well.
In extreme scenarios our cluster would end up looking like this:
We started with a single dedicated server with an 8-core Intel i7 CPU, 8 GB of RAM, 2×750 GB harddisks in RAID-1 configuration and a 100 Mbit network connection. This server hosted all components in the above picture on the same machine. We explicitly chose not to use several smaller, virtualized machines in the beginning of the beta period for efficiency reasons: our experience with virtualization is that they impose significant overhead, especially in the area of disk I/O. Running all services on the bare metal allows us to get the most out of the hardware.
Update: we didn’t plan on running on a single server forever. The plan was to run on a single server for a week or two, see whether people are interested in Union Station, and if so add more servers for high availability etc. That’s why we launched it as a beta and not as a final.
During day 1 and day 2
Equipped with a pretty beefy machine like this we thought it would hold out for a while, allowing us to gauge whether Union Station will be a success before investing more in hardware. It turned out that we had underestimated the amount of people who would join the beta as well as the amount of data they have.
Within 12 hours of launching the beta, we had already received over 100 GB of data. Some of our users sent hundreds of request logs per second to our service, apparently running very large websites. The server had CPU power in abundance, but with this much data our hard disk started becoming the bottleneck: the background workers were unable to write the indexed data to MongoDB quickly enough. Indeed, ‘top’ showed that the CPUs were almost idle while ‘iotop’ showed that the hard disks were running at full speed. As a result the RabbitMQ queues started becoming longer and longer. This is the reason why many users who registered after 8 hours didn’t see their data for several minutes. By the next morning the queue had become even longer, and the background workers were still busy indexing data from 6 hours ago.
During day 2 we performed some emergency filesystem tweaks in an attempt to make things faster. This did not have significant effect, and after a short while it was clear: the server could no longer handle the incoming data quickly enough and we need more hardware. The plan in the short term was to order additional servers and shard MongoDB across all 3 servers, thereby cutting the write load on each shard by 1/3rd. During the afternoon we started ordering 2 additional servers with 24 GB RAM and 2×1500 GB hard disks each, which were provisioned within several hours. We setup these new harddisks in RAID-0 instead of RAID-1 this time for better write performance, and we formatted with the XFS filesystem because that tends to perform best with large database files like MongoDB’s. By the beginning of the evening, the new servers were ready to go.
Update: RAID-0 does mean that if one disk fails we lose pretty much all data. We take care of this by making separate backups and setting up replicas on other servers. We’ve never considered RAID-1 to be a valid backup strategy. And of course RAID-0 is not a silver bullet for increasing disk speed but it does help a little, and all tweaks and optimizations add up.
Update 2: Some people have pointed out that faster setups exist, e.g. by using SSD drives or by getting servers with more RAM for keeping the data set in memory. We are well aware of these alternatives, but they are either not cost-effective or couldn’t be provisioned by our hosting provider within 6 hours. We are well aware of the limitations of our current setups and should demand ever rise to a point where the current setup cannot handle the load anymore we will definitely do whatever is necessary to scale it, including considering SSDs and more RAM.
MongoDB shard rebalancing caveats
Unfortunately MongoDB’s shard rebalancing system proved to be slower than we hoped it would be. During rebalancing there was so much disk I/O we could process neither read nor write requests in reasonable time. We were left with no choice but to take the website and the background workers offline during the rebalancing.
After 12 hours the rebalancing still wasn’t done. The original server still had 60% of the data, while the second server had 30% of the data and the third server 10% of the data. Apparently MongoDB performs a lot of random disk seeks and inefficient operations during the rebalancing. Further down time was deemed unacceptable so after an emergency meeting we decided to take the following actions:
- Disable the MongoDB rebalancer.
- Take MongoDB offline on all servers.
- Manually swap the physical MongoDB database files between server 1 and 3, because the latter has both a faster disk array as well as more RAM.
- Change the sharding configuration and swap server 1 and 3.
- Put everything back online.
These operations were finished in approximately 1.5 hours. After a 1 hour testing and maintenance session we had put Union Station back online again.
Even though MongoDB’s shard rebalancing system did not perform to our expectations, we do plan on continuing to use MongoDB. Its sharding system – just not the rebalancing part – works very well and the fact that it’s schemaless is a huge plus for the kind of data we have. During the early stages of the development of Union Station we used a custom database that we wrote ourselves, but maintaining this database system soon proved to be too tedious. We have no regrets switching to MongoDB but have learned more about its current limits now. Our system administration team will take these limits in mind next time we start experiencing scaling issues.
Our cluster now consists of 3 physical servers:
- All servers are equipped with 8-core Intel i7 CPUs.
- The original server has 8 GB of RAM and 2×750 GB harddisks in RAID-1.
- The 2 new servers have 24 GB of RAM each and 2×1500 GB harddisks in RAID-0 and are dedicated database servers.
We did not spend the hours only idly waiting. We had been devising plans to optimize I/O at the application level. During the down time we had made the following changes:
- The original database schema design had a lot of redundant data. This redundant data was for internal book keeping and was mainly useful during development and testing. However it turned out that it is unfeasible to do anything with this redundant data in production because of the sheer amount of it. We’ve removed this redundant data, which in turn also forced us to remove some development features.
- One of the main differentiation points of Union Station lies in the fact that it logs all requests. All of them, not just slow ones. It logs a lot of details like events that have occurred (as you can see in the request timeline) and cache accesses. As a result the request logs we receive can be huge.
We’ve implemented application-level support for compression for some of the fields. We cannot compress all fields because we need to run queries on them, but the largest fields are now stored in gzip-compressed format in the database and are decompressed at the application level.
How we want to proceed from now
We’ve learned not to underestimate the amount of activity our users generate. We did expect to have to eventually scale but not after 1 day. In order to prevent a meltdown like this from happening again, here’s our plan for the near future:
- Registration is still closed for now. We will monitor our servers for a while, see how it holds up, and it they perform well we might open registration again.
- During this early stage of the open beta, we’ve made the data retention time 7 days regardless of which plan was selected. As time progresses we will slowly increase the data retention times until they match their paid plans. When the open beta is over we will honor the paid plans’ data retention times.