Using Varnish in front of your Amazon S3 static content

Many startups these days are using Amazon S3 to serve directly their static assets. S3 is being used as a simple CDN instead of more professional (and expensive) solutions (including Amazon’s own CloudFront) because it is very simple and cheap to use. Still if you have a high traffic site, this will no longer be so cheap since you will be paying for all those requests and the bandwidth. In such cases if you still want to use S3 for the storage advantage (like storing millions of files and see it as an unlimited storage space) but not have your bill go up like crazy, you can use a reverse proxy or web accelerator to cache your assets locally and reduce the number of direct hits on S3. We could use Squid or Varnish for this, and in this article I will show how we can configure Varnish for this. We are using varnish with S3 on various projects and it works very well, simplifying the setup and saving a lot of money in the Amazon S3 bill.

Varnish is a state-of-the-art, high-performance HTTP accelerator. It uses the advanced features in Linux 2.6, FreeBSD 6/7 and Solaris 10 to achieve its high performance. I will not go over the installation of varnish here, but I would highly recommend to use the latest version available at this time 2.0.4 as older versions have various issues.

We could try to use something simple like this in a varnish vcl:

backend s3 {
  set backend.host = "my_bucket.s3.amazonaws.com";
  set backend.port = "80";
}

sub vcl_recv {
  if (req.url ~ "\.(css|gif|ico|jpg|jpeg|js|png|swf|txt)$") {
    set req.backend = s3;
    lookup;
  }
}

but unfortunately this will not work. The Amazon S3 servers will look into_ the hostname passed by the request_ and this will most likely be different than the amazon bucket (something like static.mydomain.com) and hence will return 403 on any such request.

There are several solutions to make this work correctly, and the first one I will present is going to insert the bucket name in the actual url passed to the S3 backed. This looks like:

backend s3 {
   set backend.host = "s3.amazonaws.com";
   set backend.port = "80";
}

sub vcl_recv {
   if (req.url ~ "\.(css|gif|ico|jpg|jpeg|js|png|swf|txt)$") {
     set req.url = regsub(req.url, "^", "/my_bucket");
     set req.http.host = "localhost";
     set req.backend = s3;
     lookup;
   }
}

this will work fine, inserting the bucket name in the actual url passed to the backend. Still I don’t like this solution very much as it changes the consistency between the urls (direct one and the forwarded one) so here is a much better solution:

backend s3 {
   set backend.host = "s3.amazonaws.com";
   set backend.port = "80";
}

sub vcl_recv {
   if (req.url ~ "\.(css|gif|ico|jpg|jpeg|js|png|swf|txt)$") {
     set req.http.host = "my_bucket.s3.amazonaws.com";
     set req.backend = s3;
     lookup;
   }
}

As we can see, we are setting the http host the the one Amazon S3 servers would expect for our bucket. So we can keep the same url and don’t mess with the actual link we are passing.

A complete varnish vcl configuration to use with the Amazon S3 backend might look like this:

backend s3 {
  .host = "s3.amazonaws.com";
  .port = "80";
}

sub vcl_recv {
  if (req.url ~ "\.(css|gif|ico|jpg|jpeg|js|png|swf|txt)$") {
      unset req.http.cookie;
      unset req.http.cache-control;
      unset req.http.pragma;
      unset req.http.expires;
      unset req.http.etag;
      unset req.http.X-Forwarded-For;

      set req.backend = s3;
      set req.http.host = "my_bucket.s3.amazonaws.com";

      lookup;
  }
}

sub vcl_fetch {
  unset obj.http.X-Amz-Id-2;
  unset obj.http.X-Amz-Meta-Group;
  unset obj.http.X-Amz-Meta-Owner;
  unset obj.http.X-Amz-Meta-Permissions;
  unset obj.http.X-Amz-Request-Id;

  set obj.ttl = 1w;
  set obj.grace = 30s;
}

If you found this post interesting, stay tuned for future posts on varnish and how to use it in more complex setups ;) .

comments powered by Disqus