Igvita.com has recently published the article Rails Performance Needs an Overhaul. Rails performance… no, Ruby performance… no Rails scalability… well something is being criticized here. From my experience, talking about scalability and performance can be a bit confusing because the terms can mean different things to different people and/or in different situations, yet the meanings are used interchangeably all the time. In this post I will take a closer look at Igvita’s article.
Performance vs scalability
Let us first define performance and scalability. I define performance as throughput; number of requests per second. I define scalability as the amount of users a system can concurrently handle. There is a correlation between performance and scalability. Higher performance means each request takes less time, and so is more scalable, right? Sometimes yes, but not necessarily. It is entirely possible for a system to be scalable, yet manages to have a lower throughput than a system that’s not as scalable, or for a system to be uber-fast yet not very scalable. Throughout this blog post I will show several examples that highlight the difference.
“Scalability” is an extremely loaded word and people often confuse it with “being able to handle tons and tons of traffic”. Let’s use a different term that better reflects what Igvita’s actually criticizing: concurrency. Igvita claims that concurrency in Ruby is pathetic while referring to database drivers, Ruby application servers, etc. Some practical examples that demonstrate what he means are as follows.
Limited concurrency at the app server level
Mongrel, Phusion Passenger and Unicorn all use a “traditional” multi-process model in which multiple Ruby processes are spawned, each process handling a single request per second. Thus, concurrency is (assuming that the load balancer has infinite concurrency) limited by the number of Ruby processes: having 5 processes allow you to handle 5 users concurrently.
Threaded servers, where the server spawns multiple threads, each handling 1 connection concurrently, allow more concurrency because because it’s possible to spawn a whole lot more threads than processes. In the context of Ruby, each Ruby process needs to load its own copy of the application code and other resources, so memory increases very quickly as you spawn additional processes. Phusion Passenger with Ruby Enterprise Edition solves this problem somewhat by using copy-on-write optimizations which save memory, so you can spawn a bit more processes, but not significantly (as in 10x) more. In contrast, a multi-threaded app server does not need as much memory because all threads share application code with each other so you can comfortably spawn tens or hundreds of threads. At least, this is the theory. I will later explain why this does not necessarily hold for Ruby.
When it comes to performance however, there’s no difference between processes and threads. If you compare a well-written multi-threaded app server with 5 threads to a well-written multi-process app server with 5 processes, you won’t find either being more performant than the other. Context switch overhead between processes and threads are roughly the same. Each process can use a different CPU core, as can each thread, so there’s no difference in multi-core utilization either. This reflects back on the difference between scalability/concurrency and performance.
Multi-process Rails app servers have a concurrency level that can be counted with a single hand, or if you have very beefy hardware, a concurrency level in the range of a couple of tens, thanks to the fact that Rails needs about 25 MB per process. Multi-threaded Rails app servers can in theory spawn a couple of hundred of threads. After that it’s also game over: an operating system thread needs a couple MB of stack space, so after a couple hundreds of threads you’ll run out of virtual memory address on 32-bit systems even if you don’t actually use that much memory.
There is another class of servers, the evented ones. These servers are actually single-threaded, but they use a reactor style I/O dispatch architecture for handling I/O concurrency. Examples include Node.js, Thin (built on EventMachine) and Tornado. These servers can easily have a concurrency level of a couple of thousand. But due to their single-threaded nature they cannot effectively utilize multiple CPU cores, so you need to run a couple of processes, one per CPU core, to fully utilize your CPU.
The limits of Ruby threads
Ruby 1.8 uses userspace threads, not operating system threads. This means that Ruby 1.8 can only utilize a single CPU core no matter how many Ruby threads you create. This is why one typically needs multiple Ruby processes to fully utilize one’s CPU cores. Ruby 1.9 finally uses operating system threads, but it has a global interpreter lock, which means that each time a Ruby 1.9 thread is running it will prevent other Ruby threads from running, effectively making it the same multicore-wise as 1.8. This is also explained in an earlier Igvita article, Concurrency is a Myth in Ruby.
On the bright side, not all is bad. Ruby 1.8 internally uses non-blocking I/O while Ruby 1.9 unlocks the global interpreter lock while doing I/O. So if one Ruby thread is blocked on I/O, another Ruby thread can continue execution. Likewise, Ruby is smart enough to cause things like sleep() and even waitpid() to preempt to other threads.
On the dark side however, Ruby internally uses the select() system call for multiplexing I/O. select() can only handle 1024 file descriptors on most systems so Ruby cannot handle more than this number of sockets per Ruby process, even if you are somehow able to spawn thousands of Ruby threads. EventMachine works around this problem by bypassing Ruby’s I/O code completely.
Naive native extensions and third party libraries
So just run a couple of multi-threaded Ruby processes, one process per core and multiple threads per process, and all is fine and we should be able to have a concurrency level of up to a couple hundred, right? Well not quite, there are a number of issues hindering this approach:
- Some third party libraries and Rails plugins are not thread-safe. Some aren’t even reentrant. For example Rails < 2.2 suffered from this problem. The app itself might not be thread-safe.
- Although Ruby is smart enough not to let I/O block all threads, the same cannot be said of all native extensions. The MySQL extension is the most infamous example: when executing queries, other threads cannot run.
Mongrel is actually multi-threaded but in practice everybody uses in multi-process mode (mongrel_cluster) exactly because of these problems. It is also the reason why Phusion Passenger has also gone the multi-process route.
And even though Thin is evented, a typical Ruby web application running on Thin cannot handle thousands of concurrent users. This is because evented servers typically require a special evented programming style, such as the one seen in Node.js and EventMachine. A Ruby web app that is written in an evented style running on Thin can definitely handle a large number of concurrent users.
When is limited application server concurrency actually a problem?
Igvita is clearly disappointed at all all the issues that hinder Ruby web apps from achieving high concurrency. For many web applications I would however argue that limited concurrency is not a problem.
- Web applications that are slow, as in CPU-heavy, max out CPU resources pretty quickly so increasing concurrency won’t help you.
- Web applications that are fast are typically quick enough at handling the load so that even large number of users won’t notice the limited concurrency of the server.
Having a concurrency of 5 does not mean not mean that the app server can only handle 5 requests per second; it’s not hard to serve hundreds of requests per second with only a couple of single-threaded processes.
The problem becomes most evident for web applications that have to wait a lot for I/O (besides its own HTTP request/response cycle). Examples include:
- Apps that have to spend a lot of time waiting on the database.
- Apps that perform a lot of external HTTP calls that respond slowly.
- Chat apps. These apps typically have thousands of users, most of them doing nothing most of the time, but they all require a connection (unless your app uses polling, but that’s a whole different discussion).
We at Phusion have developed a number of web applications for clients that fall in the second category, the most recent one being a Hyves gadget. Hyves is the most popular social network in the Netherlands and they get thousands of concurrent visitors during the day. The gadget that we’ve developed has to query external HTTP servers very often, and these servers can take 10 seconds to respond in extreme cases. The servers are running Phusion Passenger with maybe a couple tens of processes. If every request to our gadget also causes us to wait 10 seconds for the external HTTP call then we’d soon run out of concurrency.
But even suppose that our app and Phusion Passenger can have a concurrency level of a couple of thousand, all of those visitors will still have to wait 10 seconds for the external HTTP calls, which is obviously unacceptable. This is another example that illustrates the difference between scalability and performance. We had solved this problem by aggressively caching the results of the HTTP calls, minimizing the number of external HTTP calls that are necessary. The result is that even though the application’s concurrency is fairly limited, it can still comfortably serve many concurrent users with a reasonable response time.
This anecdote should explain why I believe that web apps can get very far despite having a limited concurrency level. That said, as Internet usage continues to increase and websites get more and more users, we may at some time come to a point where much a larger concurrency level is required than most of our current Ruby tools allow us to (assuming server capacity doesn’t scale quickly enough).
What was Igvita.com criticizing?
Igvita.com does not appear to be criticizing Ruby or Rails for being slow. It doesn’t even appear to be criticizing the lack of Ruby tools for achieving high concurrency. It appears to be criticizing these things:
- Rails and most Ruby web application servers don’t allow high concurrency by default.
- Many database drivers and libraries hinder concurrency.
- Although alternatives exist that allow concurrency, you have to go out of your way to find them.
- There appears to be little motivation in the Ruby community for making the entire stack of web frame work + web app server + database drivers etc scalable by default.
This is in contrast to Node.js where everything is scalable by default.
Do I understand Igvita’s frustration? Absolutely. Do I agree with it? Not entirely. The same thing that makes Node.js so scalable is also what makes it relatively hard to program for. Node.js enforces a callback style of programming and this can eventually make your code look a lot more complicated and harder to read than regular code that uses blocking calls. Furthermore, Node.js is relatively young – of course you won’t find any Node.js libraries that don’t scale! But if people ever use Node.js for things other than high-concurrency servers apps, then non-scalable libraries will at some time pop up. And then you will have to look harder to avoid these libraries. There is no silver bullet.
That said, all would be well if at least the preferred default stack can handle high concurrency by default. This means e.g. fixing the MySQL extension and have the fix published by upstream. The mysqlplus extension fixes this but for some reason their changes aren’t accepted and published by the original author, and so people end up with a multi-thread-killing database driver by default.
Is Node.js innovative? Is Ruby lacking innovation?
Igvita appears to be criticizing something other than Rails performance, as his article’s title would imply.
I don’t think the concurrency levels that the Rails stack provides by default is that bad in practice. But as a fellow programmer, it does intuitively bother me that our laptops, which are a million times more powerful than supercomputers from two decades ago, cannot comfortably handle a couple of thousand concurrent users. We can definitely work towards something better, but in the mean time let’s not forget that the current stack is more than capable of Getting Work Done(tm).