Uploading Files 21
Anyone who’s built a rails application that deals with large file uploads probably has a few horror stories to tell about it. While some people love to overstate the issues for their own purposes, it’s still something that can be quite challenging to do well.
What’s the Problem?
As I mentioned in the article on File Downloads, your rails processes are a scarce resource. You need them to be free to handle your applications’ requests, if they’re all busy, your users will be left waiting. When we optimised the download processes we made sure that we used our webservers instead of tying up a rails process to spoon feed the file out over the network to your users. Dealing with uploads has a similar problem.
When a browser uploads a file, it encodes the contents in a format called ‘multipart mime’ (it’s the same format that gets used when you send an email attachment). In order for your application to do something with that file, rails has to undo this encoding. To do this requires reading the huge request body, and matching each line against a few regular expressions. This can be incredibly slow and use a huge amount of CPU and memory.
While this parsing is happening your rails process is busy, and can’t handle other requests. Pity the poor user stuck behind a request which contains a 100M upload!
What’s not the problem?
Some people seem to think that the File Upload problem with rails is that the entire process is blocked while the browser sends the encoded body to you. This isn’t not true, and hasn’t been for a long time. Whether you’re using nginx + mongrel, apache + mongrel or apache + passenger, your web server buffers the entire request before rails locks itself for processing. So no matter how slow a user’s connection is, your application isn’t locked while they upload their file.
What can you do?
There are a number of unattractive options to work around this slow multipart-parsing. The most common I’ve seen is to send uploads to a non-rails process such as a CGI script or a merb/mongrel/rack application. CGI scripts have the obvious disadvantage that you need to write a script simple enough to start up quickly and featured enough to process your uploads. Doing it in rack leaves you relying on ruby’s threading to handle parallelism. This is probably not what you want and your throughput is probably much lower than it would be without that upload being processed.
What else can you do?
Because neither of these options were acceptable Pratik Naik and I have built a Mod Porter an Apache module that does the heavy lifting for your file uploads. All of the hard stuff is done by libapreq though, so you don’t have to worry about using C code written by two ruby programmers!
Porter is essentially the inverse of X-SendFile. It parses the multipart post in C inside your apache process and writes the files to disk. Once that work is done it changes the request to look like a regular form POST which contains pointers to the temp files on disk. To maintain system security it also signs the modified parameters so people can’t attack your system like those old PHP apps.
This means that your rails processes don’t have to deal with anything more than a regular form post which is nice and fast. In addition to the apache module, Porter also includes a Rails Plugin which hides all of this from you. It makes an upload handled by Porter, look just like a regular Rails Upload.
How fast is it?
The speed of upload parsing isn’t particularly relevant, the reduced locking is far more important. Your user’s internet connection is much more important for the round-trip upload performance than your upload handler’s parser.
Having said all that, Porter runs significantly faster than the equivalent pure-ruby parsing code. Depending on the size and number of uploads we’ve seen response times between 30 and 200 times as fast. That’s not just compared to rails’ upload parser, it’s that much faster than every other ruby mime parser we tried.
Isn’t this just like the Nginx module?
Kinda. We’ve been thinking about this module ever since we started using lighttpd’s X-SendFile header. When I saw the nginx module get released I decided to start planning the Apache equivalent. Porter is completely transparent to your application, you don’t need a special form action, and you don’t need to tell Porter what form fields to pass through to the web application. This means you can use porter in production, and mongrel or thin in development, without any changes to your application.
The biggest improvement from this is that you don’t need to change your nginx config every time you add a new input to a form, or a new file upload to your application. This is extremely tedious and error prone, especially when making these changes involves a support ticket with your hosting provider. The major goal we have with Porter is to make sure it always ‘Just Works’, so you can put a file upload into any form without having to worry about your web server.
Getting Started
Porter is still beta software, so you’re strongly advised to test it first, but you already knew that. The porter website has the installation instructions. Once you’ve got that done you’ll need to add the rails plugin, and configure them to share a nice secure secret. Then, hopefully, your application will Just Work but your uploads will be much less painful.
If you have any issues getting it running, leave us a note on the git hub issues page.
Storing Your Files 16
This is the second article in my series on file management, the third article will cover the challenges of handling uploads then we should be able to move on to some more advanced topics.
The second problem you’ll face when building an application to handle files is where and how to store them. Thankfully there are lots of well-supported options, each with their own pros and cons.
The local file system
If your application only runs on a single server, the simplest option is to store them on the local disk of your web/application server. This leaves you with very few moving parts, and you know that both your rails application and your webserver can see the same files, at the same location. But even though this is a simple option there are a few things that you need to be careful of.
A common mistake I see is to use a single directory to handle all of the users’ uploaded files. So your directory structure ends up looking something like this:
/home/railsway/uploads/koz_avatar.png
/home/railsway/uploads/dhh_avatar.png
/home/railsway/uploads/other_avatar.png
The first, and most obvious, problem with this structure is that unless you’re careful you could end up with users overwriting each other’s files. The second, and more painful problem is that you end up with too many files in a single directory which will cause you some pain when you try to do things like list the directory or start removing old files.
The best bet is to store the uploads in a directory which corresponds to the ID of the object which owns those files. But something like the following will also leave you with a huge directory:
/home/railsway/uploads/1/koz_avatar.png
/home/railsway/uploads/2/dhh_avatar.png
/home/railsway/uploads/3/other_avatar.png
The best bet is to partition that directory into a number of sub directories like this:
/home/railsway/uploads/000/000/001/koz_avatar.png
/home/railsway/uploads/000/000/002/dhh_avatar.png
/home/railsway/uploads/000/000/003/other_avatar.png
Thankfully both of the popular file management plugins have built in support for partitioned storage :id_partition in paper clip and :partition in attachment_fu.
NFS, GFS and friends
Once you’ve grown beyond a single app / web server, using the file-system gets a little more complicated. In order to ensure that all your app and web servers can see the same files you have to use a shared file system of some sort. Setting up and running a shared file system is beyond the scope of this site, but a few words of caution.
It’s deceptively easy to set up a simple NFS server for your network and just run your application as you did when it was on a single disk, but some things which are cheap on local disk are slow and expensive over NFS and friends. Make sure you stress test your file server and pay an expert to help you tune the system. The bigger problem I’ve had with NFS and GFS is the impact of downtime or difficulties on your application. Your NFS server becomes a single point of failure for your whole site, and a minor network glitch can render your application completely useless as all the processes get tied up waiting on a blocking read from an NFS mount that’s gone away.
You can solve all those kinds of problems by hiring a good sysadmin and / or spending a large amount of money on serious storage hardware. It’s not a path that I personally choose, but it’s definitely an option you should consider.
Amazon S3
It’s not really possible to write about storage without touching on Amazon S3. In case you’ve been living under a rock for a few years S3 is a hugely scalable, incredibly cheap storage service. There are several good gems to use with your applications and the major file management plugins provide semi-transparent S3 support.
S3 isn’t a file system so there are several things which you have to do differently, however there are alternatives for most of those operations. For instance instead of using X-Sendfile to stream the files to your user, you redirect them to the signed url on amazon’s own service. By way of example our download action from the earlier article would look like this if using S3 and marcel’s s3 library
def download
redirect_to S3Object.url_for('download.zip',
'railswayexample',
:expires_in => 3.hours)
end
But there are a few things you have to be careful with when using S3. The first is that uploading to s3 is much slower than simply writing your file to local disk. Unless you want your rails processes to be tied up for ages, you’ll probably want to have a background job running which transfers the files from your server up to amazon’s. Another factor is that when S3 errors occur your users will be greeted by a very ugly error page:

Finally there’s always the risk of amazon having another bad day which takes your application down for a few hours. Amazon’s engineers are pretty amazing, but nothing’s perfect.
Other options
There are a few options I’ve not used before, but you could investigate:
BLOBs in your database
I’ve never been a fan of using BLOBs to store large files, however some people swear by them. If you’re aware of great tutorial resources for BLOBs and rails, let me know and I’ll link to them from here.
Rackspace’s Cloud Files
When it was first announced Cloud Files from rackspace seemed like it was going to be a great competitor to S3. However there’s currently no equivalent to S3’s signed-url authentication option which means downloads become much harder. To use Cloud Files would require you to build a streaming proxy in your application, and use it to stream files from rackspace back out to the user. You’d also have to pay for the bandwidth twice, once from rackspace, and once from your hosting provider.
This makes it much more complicated than S3 but hopefully this will be addressed in a future release.
MogileFS
MogileFS is a really interesting option. It has some similarities to S3 in that it’s a write-once file storage system which operates over HTTP. But unlike S3 it’s open source software you can run on your own servers. Unfortunately MogileFS is really thinly documented and quite difficult to get up and running. If you know of a really good getting-started tutorial for MogileFS, let me know and I’ll link to it from here.
It also would require you to use perlbal for your load balancer or find an apache module that can support X-Reproxy-Url.
Conclusion
There are a bunch of different options you should consider when picking the storage for your file uploads. Generally my advice would be to start with simple on-disk partitioned storage and grow from there. Don’t rush straight to S3 because all the blogs tell you to, stay as simple as possible for as long you can.
File Downloads Done Right 26
Getting your file downloads right is one of the most important parts of your File Management functionality. A poorly implemented download function can make your application painful to use, not just for downloaders, but for everyone else too.
Thankfully it’s also one of the easiest things to get right.
The simple version
For the purposes of this article let’s assume that your application needs to provide access to a large zip file, but that access should be restricted to logged in users.
The first choice we have to make is where to store this file. In this case there’s really only one wrong answer, and that’s to store it in the public folder of your rails application. Every file stored in public will be served by our webserver without the involvement of our rails application. This makes it impossible for us to check that the user has logged in. Unless your files are completely public, you shouldn’t go anywhere near the public folder.
So let’s assume we’ve stored the zip file in:
/home/railsway/downloads/huge.zip
Next we need a simple download action to send the file to the user, thankfully rails has this built right in:
before_filter :login_required
def download
send_file '/home/railsway/downloads/huge.zip', :type=>"application/zip"
end
Now when our users click the download link, they’ll be asked to choose a location and then be able to view the file. The bad news is, there’s a catch here. The good news is it’s easy to fix.
What’s the catch?
The problem here is one of scarce resources, and that resource is your rails processes. Whether you’re using mongrel, fastcgi or passenger you have a limited number of rails processes available to handle application requests. When one of your users makes a request, you want to know that you either have a process free to handle the request, or that one will become free in short order. If you don’t, users will face an agonizing wait for pages to load, or see their browser sessions timeout entirely.
When you use the default behaviour of send_file to send the file out to the user, your rails process will read through the entire file, copying the contents of the file to the output stream as it goes. For small files like images this probably isn’t that big of a deal, but for something enormous like a 200M zip file, using send_file will tie up a process for a long time. Users on slow connections will soak up a rails process for correspondingly longer.
If you get a large number of downloads running, you may find all your rails processes taken up by downloaders, with none left to serve any other users. For all intents and purposes your site is down: you’ve instituted a denial of service attack against yourself.
What about threads?
Unfortunately threads in ruby won’t save us. The combination of blocking IO and green threads mean that even though you’re doing the work in a thread, it’s blocking the entire process most of the time anyway. JRuby users may get a performance improvement, but it’s still going to be a noticeable consumption of resources when compared to letting a web server stream the file.
Don’t believe everything you read on the internet, threads and ruby just won’t help you with most of this stuff.
So What’s the Solution?
Thankfully this problem was solved a long time ago by the guys at live journal. They used perl instead of ruby, but had the same problems. Downloading files would block their application processes for too long, and cause other users to have to wait. Their solution was elegant and simple. Instead of making the application processes stream the file to the user, they simply tell the webserver what file to send, and let the web server bother with the details of streaming the file out to the client.
Their particular solution is quite cumbersome to set up and use, but there’s a very similar solution available called X-Sendfile. It’s supported out of the box with later versions of lighttpd, and available as a module for apache.
The way it works is instead of sending the file to our users, our rails application will simply check they’re allowed to download it (using our login_required filter) then write the name of the file into a special response header then render an empty response. Once apache sees that response it will read the file from disk and stream it out to the user. So your headers will look something like:
X-Sendfile: /home/railsway/downloads/huge.zip
The apache module has a slightly annoying default setting that prevents it from sending files outside the public folder, so you’ll need to add the following configuration option:
XSendFileAllowAbove on
Thankfully for rails users x-sendfile support is built right in to rails, allowing us to make a few minor changes and we’re done.
before_filter :login_required
def download
send_file '/home/railsway/downloads/huge.zip', :type=>"application/zip", :x_sendfile=>true
end
With that, we’re done. Our rails process just make a quick authorization check and render a short response, and apache uses its own optimised file streaming code to send the file down to our users. Meanwhile, the rails process is free to go on to the next request.
Nginx users can use a similar header called X-AccelRedirect. This is a little more fiddly to set up, and requires your application to write a special internal URL to the http response rather than the full path, but in terms of scalability and resource contention, it’s just as great. There’s an intro to the nginx module available if you’re an nginx user. If only uploads were this easy!
Up Next
The next article in the series will cover my experiences when dealing with the storage of your files. Should you use S3? What about blobs, NFS, GFS or MogileFS?

The Rails Way is all about teaching "best practices"
in 