The Road to Passenger 3: Technology Preview 1 – Performance
It has already been two years since we’ve first released Phusion Passenger. Time sure flies and we’ve come a long way since then. We were the first to implement a working Ruby web app deployment solution that integrates seamlessly in the web server, and all the features that we’ve developed over time – smart spawning and memory reduction, upload buffering, Nginx support, etc – have served us for a long time. Nevertheless, it is time to say goodbye to the old Phusion Passenger 2.2 codebase. In the past we had focused primarily on three things:
- Ease of use.
- Stability.
- Robustness.
Notice that “performance” is not on the above list. We strived to make Phusion Passenger “fast enough”, e.g. not ridiculously slower than the alternatives. Lately it would appear that competitors are once again focusing on performance. We can of course not afford to stay behind. We’ve been working on Phusion Passenger 3 for a while now. Today we will begin unveiling the technology behind this new major Phusion Passenger version. This blog post is the first of the multiple technology previews to come.
The performance test
It’s not very useful to benchmark Phusion Passenger performance using a Rails application because most of the time is spent in Rails and the application itself. Therefore we’ll benchmark with a simple Rack application. Consider the following hello world Rack application:
app = proc do |env| [200, { "Content-Type" => "text/html" }, ["hello world\n"]] end run app
How fast does this run on Phusion Passenger 2.2?
- Operating system: OS X Snow Leopard
ab -c 25 -n 10000 http://rack.test/
, pool size 25- Apache: 1628 req/sec
- Nginx: 1843 req/sec
Now let’s look at Phusion Passenger 3:
- Operating system: OS X Snow Leopard
- Apache: 2225 req/sec; 36% faster
- Nginx: 2864 req/sec; 55% faster
That’s right, the Nginx version is over 50% faster than 2.2 on OS X!
A graph is worth more than a thousand words
Suffice it to say, even though Phusion Passenger was already pretty fast, we believe we’ve created some pretty significant improvements in terms of performance and it will be interesting to see how the final version of Phusion Passenger 3 will stack up against the competition. Needless to say, we’ve performed our own benchmarks already and have concluded that “the self-proclaimed fastest deployment solution” really isn’t the fastest deployment solution compared to Phusion Passenger 3. 😉 That said, benchmarks are lies, lies, lies, damn lies of course and your mileage may definitely vary so we will encourage you to perform any kind of benchmark you’d like when we release 3. For us, the most important issue still lies in the trade off of how much time you have to spend actually maintaining your setup, but as the graphs indicate, we’ve made some pretty monstrous improvements to performance as well.
How did we do it?
When it comes to optimizing software, there’s the saying that 20% of the code is responsible for 80% of the time. Not so with Phusion Passenger: we’ve found that there were no obvious performance bottlenecks. Even profilers turned out to be totally useless because all the times are so small and so close to each other. Phusion Passenger was already pretty fast.
Instead, we optimized the hard way: with lots and lots of micro-optimizations. 2% here, 3% there, etc etc. In other words, blood, sweat, tears and lots of sleepless nights. The optimizations can be summed up as follows:
- Reducing system calls
- System calls are pretty expensive compared to userspace computation. They require a context switch to the kernel. For example, all I/O operations (read(), write()) are system calls. We’ve performed an extensive code inspection and removed and coalesced a lot of redundant system calls.
- The beginning of a zero-copy I/O architecture
- The CPU is very fast nowadays. In fact it is so fast that RAM speed cannot keep up with the CPU. This makes memory access very expensive. In case of I/O intensive applications such as web servers, one would benefit from copying I/O data as little as possible. In order to optimize memory access, we’ve implemented the beginning of a zero-copy I/O architecture. This architecture covers both the C++ and the Ruby parts of Phusion Passenger.
- Less Ruby garbage production
- The garbage collector in Ruby can be a significant bottleneck. We’ve heavily optimized the Ruby part of the request handler and reduced creation of Ruby objects to a minimum. This made the request handler significantly faster in our tests.
- Optimizing algorithms and optimizing Ruby code in C
- Some algorithms have been optimized, e.g. some O(N) algorithms have been replaced by O(log N) or O(1) algorithms. Some key Ruby code has been replaced by C code. The former didn’t give us a lot of performance because all the O(N) algorithms weren’t doing a lot of work in the first place, but the latter gave us a much more noticeable boost.
- Reducing context switches
- Phusion Passenger is heavily multithreaded and consists of multiple threads and service processes. However, some communication between threads and processes required round trips, which caused more context switches than necessary. We’ve optimized our internal protocols and reduced context switching to a minimum.
The future
As stated in this blog post, this is just a glimpse of what we’ve got in store for you and as you’ve come to expect from us, we want to make sure that our findings will hold up in real life scenarios as well. With close to two years of field testing with Phusion Passenger 2, witnessing some of the most high demanding environments in web hosting of our clients, we’ve been working for the last few months now on forging this experience back into Phusion Passenger 3. Through beta testing in these high-demand Rails environments, we hope to ensure that they will give you the best experience both in an enterprise environment as well as for your personal use. Performance has been touched upon in this blog post, and in the coming period leading up to the release of Phusion Passenger 3, we’ll start to unveil bit by bit what we’ve been tinkering on for the last few months. In particular, we’re looking forward how the zero-copy I/O architecture will unfold in a real life scenario as well as the optimizations we’ve performed over the last months. Even though we’re not done yet in terms of optimizing, we will likely hit a ceiling at some point where optimizations will get harder and harder and this in particular is true if you want to retain features such as ease of use that define Phusion Passenger. One thing is for sure, we want this release to be nothing less than stunning so we encourage you to submit your wish list to us as well. We’ve likely implemented a lot of them already, but we just want to make sure that we’re not missing anything.