Storing Your Files
This is the second article in my series on file management, the third article will cover the challenges of handling uploads then we should be able to move on to some more advanced topics.
The second problem you’ll face when building an application to handle files is where and how to store them. Thankfully there are lots of well-supported options, each with their own pros and cons.
The local file system
If your application only runs on a single server, the simplest option is to store them on the local disk of your web/application server. This leaves you with very few moving parts, and you know that both your rails application and your webserver can see the same files, at the same location. But even though this is a simple option there are a few things that you need to be careful of.
A common mistake I see is to use a single directory to handle all of the users’ uploaded files. So your directory structure ends up looking something like this:
/home/railsway/uploads/koz_avatar.png
/home/railsway/uploads/dhh_avatar.png
/home/railsway/uploads/other_avatar.png
The first, and most obvious, problem with this structure is that unless you’re careful you could end up with users overwriting each other’s files. The second, and more painful problem is that you end up with too many files in a single directory which will cause you some pain when you try to do things like list the directory or start removing old files.
The best bet is to store the uploads in a directory which corresponds to the ID of the object which owns those files. But something like the following will also leave you with a huge directory:
/home/railsway/uploads/1/koz_avatar.png
/home/railsway/uploads/2/dhh_avatar.png
/home/railsway/uploads/3/other_avatar.png
The best bet is to partition that directory into a number of sub directories like this:
/home/railsway/uploads/000/000/001/koz_avatar.png
/home/railsway/uploads/000/000/002/dhh_avatar.png
/home/railsway/uploads/000/000/003/other_avatar.png
Thankfully both of the popular file management plugins have built in support for partitioned storage :id_partition in paper clip and :partition in attachment_fu.
NFS, GFS and friends
Once you’ve grown beyond a single app / web server, using the file-system gets a little more complicated. In order to ensure that all your app and web servers can see the same files you have to use a shared file system of some sort. Setting up and running a shared file system is beyond the scope of this site, but a few words of caution.
It’s deceptively easy to set up a simple NFS server for your network and just run your application as you did when it was on a single disk, but some things which are cheap on local disk are slow and expensive over NFS and friends. Make sure you stress test your file server and pay an expert to help you tune the system. The bigger problem I’ve had with NFS and GFS is the impact of downtime or difficulties on your application. Your NFS server becomes a single point of failure for your whole site, and a minor network glitch can render your application completely useless as all the processes get tied up waiting on a blocking read from an NFS mount that’s gone away.
You can solve all those kinds of problems by hiring a good sysadmin and / or spending a large amount of money on serious storage hardware. It’s not a path that I personally choose, but it’s definitely an option you should consider.
Amazon S3
It’s not really possible to write about storage without touching on Amazon S3. In case you’ve been living under a rock for a few years S3 is a hugely scalable, incredibly cheap storage service. There are several good gems to use with your applications and the major file management plugins provide semi-transparent S3 support.
S3 isn’t a file system so there are several things which you have to do differently, however there are alternatives for most of those operations. For instance instead of using X-Sendfile to stream the files to your user, you redirect them to the signed url on amazon’s own service. By way of example our download action from the earlier article would look like this if using S3 and marcel’s s3 library
def download
redirect_to S3Object.url_for('download.zip',
'railswayexample',
:expires_in => 3.hours)
end
But there are a few things you have to be careful with when using S3. The first is that uploading to s3 is much slower than simply writing your file to local disk. Unless you want your rails processes to be tied up for ages, you’ll probably want to have a background job running which transfers the files from your server up to amazon’s. Another factor is that when S3 errors occur your users will be greeted by a very ugly error page:

Finally there’s always the risk of amazon having another bad day which takes your application down for a few hours. Amazon’s engineers are pretty amazing, but nothing’s perfect.
Other options
There are a few options I’ve not used before, but you could investigate:
BLOBs in your database
I’ve never been a fan of using BLOBs to store large files, however some people swear by them. If you’re aware of great tutorial resources for BLOBs and rails, let me know and I’ll link to them from here.
Rackspace’s Cloud Files
When it was first announced Cloud Files from rackspace seemed like it was going to be a great competitor to S3. However there’s currently no equivalent to S3’s signed-url authentication option which means downloads become much harder. To use Cloud Files would require you to build a streaming proxy in your application, and use it to stream files from rackspace back out to the user. You’d also have to pay for the bandwidth twice, once from rackspace, and once from your hosting provider.
This makes it much more complicated than S3 but hopefully this will be addressed in a future release.
MogileFS
MogileFS is a really interesting option. It has some similarities to S3 in that it’s a write-once file storage system which operates over HTTP. But unlike S3 it’s open source software you can run on your own servers. Unfortunately MogileFS is really thinly documented and quite difficult to get up and running. If you know of a really good getting-started tutorial for MogileFS, let me know and I’ll link to it from here.
It also would require you to use perlbal for your load balancer or find an apache module that can support X-Reproxy-Url.
Conclusion
There are a bunch of different options you should consider when picking the storage for your file uploads. Generally my advice would be to start with simple on-disk partitioned storage and grow from there. Don’t rush straight to S3 because all the blogs tell you to, stay as simple as possible for as long you can.

The Rails Way is all about teaching "best practices"
in 
Hello, I’m aware of links and resources for BLOB and Rails. I’m running this way with a simple cache system, that’s working well but I’m afraid of Mysql capacity to deal with big BLOB tables.
MongoDB has nice support for storing files (or any large binary objects). The feature is called “GridFS” and is supported by all drivers.
Give it a look :
geir
Thanks for another great write-up. I just wanted to note that we’ve had good luck using the Paperclip plugin recently at my office. When using Amazon EC2 in combination with S3, a background process might not be necessary thanks to the speed of transfer between the two Amazon services. If things still aren’t fast enough, I’d recommend using the delayed_job plugin for offloading longer processes. The Paperclip plugin’s Google Group has some good tips for getting started with that.
@Trevor: I’m hoping to contrast Paperclip and AttachmentFu in another article in this series, I just want to cover the bases first. Fundamentally my biggest issue with auto uploading like that is that the retry-on-error code in right_aws and friends means you’ll potentially be uploading several times to S3, not once.
However if it works, stick with it.
Delayed job is another topic and one I’m really keen on writing up, just want to get file management all done first.
Some researchers from Microsoft Research made an interesting observation about BLOBs. Basically, anything over 1MB is terribly inefficient on storage and retrieval vs any filesystem storage.
I think the major reason for storing images and such in a database is for forward compatibility reasons; when database vendors implement a good way of image based searching, you’ll be ready.
R
Great article. In the first section, you might want to touch on the shared directory and setting up symlinks so you don’t lose uploaded files after deploying with capistrano.
Hey Koz, great summary. You gave me an idea. What about using S3, but first writing files to disk so they’re instantly available, use a bg/scheduled job to do the S3 upload and then let the app know when the file is available at S3.
@Rich Mehta,
Another reason for storing images in database BLOBs is ease/simplicity of backup. You back up the database and the images get backed up too.
I struggled a lot with Amazon S3! Thanks for this great post
Fatcow
@John Topley Definitively, that’s why I chose that way.
Must be careful when finding thumbnail, full binary data is heavy and must not be loaded :
find_by_photo_id(photo_id, :select => “thumbnail_data”).thumbnail_data
@Ivan: That’s exactly what the guys at 37signals do. If you upload an image to campfire for the first few minutes it’s stored locally.
@john @emilien: In addition to the backups, you can do transactional processing if you’re using blobs. A former client of mine uses them for the thumbnail of a user’s avatar, and swears by them. I’ve just never felt the need
Koz, what’s the “transactionnal processing” you’re talking about ?
@emilien: If you’re using BLOBs then when you roll back your transaction all changes to that blob are rolled back also. This means that you are guaranteed not to have files lying around that aren’t referenced by the database.
In practise this has never been a huge issue to me, but it could be important depending on your particular requirements. A left over user avatar isn’t a huge deal, but a PDF contract left in an inconsistent state could be a big problem
Koz, you might be able to resolve the issues you’re having with Cloud Files by making use of the built-in support for Limelight’s CDN. If you make the user uploaded files available via the CDN, then you can use the CDN URL when serving the files instead of having to proxy them. This has two advantages: you only pay for bandwidth once (from the CDN to the user), and you offload the work of serving those files to the CDN. We also offer a native Ruby API now that is fully supported by Rackspace.
@Andrew: That would only work with publically available files though, user-secured zip files or the like wouldn’t work at all.
Signed urls really should be on your TODO list :)
Hi Koz,
We use MogileFS at YouDo for our file servers.
We use Nginx to process the X-Reproxy-Url. works like a charm, and a lot faster than perlbal
Cheers.