Collecting performance data with varnish and statsd

Posted by Unknown on Monday, February 3, 2014

One of our currently bigger projects is soon going online and one part of it is an API, which only delivers JSON and XML to a lot of possible clients. To speed things up, we use varnish in front of the application (a lot of requests are easily cacheable, since they only change once a day at most). We also use HHVM for some of the requests, for now just the ones which will have many misses (from livesearch requests for example). We don’t dare yet to use HHVM for all requests, we’d like to gather some experience first with it.

One important task for such a project is to monitor the performance of the application. We usually use New Relic for that, but there’s no New Relic for HHVM (yet), so we had to look into other solutions.

Since varnish is the entry point for all requests, be it PHP or HHVM or another backend, we searched for possibilities there. We found 2 vmods, libvmod-timers and libvmod-statsd, which make that possible.

Collecting response time and number of requests

It needs a statsd and a backend for collecting the data. We decided to test librato as the backend as we have heard good things about it (btw, if you have to configure many metrics in librato with the same attributes all over, use the API of librato, it's more powerful than their webfrontend and you can script it for later new metrics)

This setup allows us to collect detailed data about the performance of the different backends (and also monitor for example if there are too many 500 answers)

Compared to the examples in the blog posts about the vmods, I made some small changes, to get the correct data for us.

To be sure our API is "always" up, we use the fallback director of varnish, especially for the HHVM part (we have much more experience in running a PHP server than a HHVM, that's why). It looks like the following


director hhvm_api_fallback fallback {
  { .backend = hhvm_api; }
  { .backend = php_api; } // will only be used if hhvm_api is unhealthy.
}

This means, that varnish will normally use the hhvm_api backend, but if that is down, it will use the php_api backend. They run on the same server, but listen on different ports (of course, the whole thing is loadbalanced in front of it to have less single point of failures)

As we only route some requests to the hhvm_api, we have the following in vcl_recv:


if (req.url ~ "^/things.json?.*search=" ) {
    set req.backend = hhvm_api_fallback;
}

We can then later extend that URL matcher to include more requests.

The example of the author of the vmods uses req.http.X-Request-Backend for reporting which backend was used with statsd. Since with the fallback director defined above that is always hhvm_api_fallback, no matter which backend was used and we wanted to know when the fallback uses php_api instead of hhvm_api, we added the following to vcl_fetch:


set req.http.X-Stats-Backend = beresp.backend.name;

Now the backend that is actually used is reported later (at least for misses, for hits it's still hhvm_api_fallback, but that is ok, since it didn't use any backend at all, wthough I'm sure this could be changed as well somehow with some header trickery in the vcl).

The librato backend seems also to interpret some stats differently than for example the default graphite backend in statsd. Therefore we used the the following lines in vcl_deliver for sending the values to statsd:


### Key to use for statsd. Something like: config.hit.200
set req.http.X-Stats-Key =
    req.http.X-Stats-Backend + "." +
    req.http.X-Stats-HitMiss + "." +
    req.http.X-Stats-Status;

### Increment the amount of requests here, and how long this 
### particular backend took.
statsd.incr(   req.http.X-Stats-Key + ".count");
statsd.timing( req.http.X-Stats-Key + ".time", timers.req_response_time() );

We basically added .count and .time to the key as librato gets confuesd if you use the same key for counters and timers.

With this data we can now plot graphs for each backend and each request type in librato. We immediatly see, if there are problems (for example too many non-200 responses or if one backend is slower than expected). We also can compare hits vs misses and collect the load of the servers with collectd.

Loggin

Truncated by Planet PHP, read more at the original (another 2850 bytes)

Collecting performance data with varnish and statsd

Collecting response time and number of requests

Loggin

{ 0 comments... » Collecting performance data with varnish and statsd read them below or add one }

Post a Comment

Popular Posts

About

Tiaro - z nami nigdy nie stracisz

hottrend

My Blog List

Popular Posts

Search