At Bugsnag, we recently launched the Releases dashboard for tracking the health of releases. It was a large undertaking, but as we built out the backend to support it we paid particular attention to performance. One of the key areas we focused on was the latency associated with our backend service calls and, in the end, we chose to switch out REST for Google’s blisteringly fast gRPC framework.
In order to successfully migrate to gRPC, we first needed to rethink our load balancing strategy to ensure that it properly supported gRPC traffic. This blog post outlines how we ultimately arrived at our decision to add Lyft’s feature rich Envoy proxy into our stack and how it fits into Bugsnag’s architecture.
Behind the scenes, Bugsnag has a pipeline of microservices responsible for processing the errors we receive from our customers that are later displayed on the dashboard. This pipeline currently handles hundreds of millions of events per day. To support the new Releases dashboard, we needed to expand the pipeline to begin receiving user sessions - something that represented a massive increase in traffic. Performance would be key for this project, and is one of the main reasons we adopted the gRPC framework.
In terms of deployment, Bugsnag’s microservices are deployed as Docker containers in the cloud using Google’s excellent Kubernetes orchestration tool. Kubernetes has built in load balancing via its kube-proxy which works perfectly with HTTP/1.1 traffic, but things get interesting when you throw HTTP/2 into the mix.
gRPC uses the performance boosted HTTP/2 protocol. One of the many ways HTTP/2 achieves lower latency than its predecessor is by leveraging a single long-lived TCP connection and to multiplex request/responses across it. This causes a problem for layer 4 (L4) load balancers as they operate at too low a level to be able to make routing decisions based on the type of traffic received. As such, an L4 load balancer, attempting to load balance HTTP/2 traffic, will open a single TCP connection and route all successive traffic to that same long-lived connection, in effect cancelling out the load balancing.
Kubernetes’ kube-proxy is essentially an L4 load balancer so we couldn’t rely on it to load balance the gRPC calls between our microservices.
One of the options we explored was using gRPCs client load balancer which is baked into the gRPC client libraries. That way each client microservice could perform its own load balancing. However, the resulting clients were ultimately brittle and required a heavy amount of custom code to provide any form of resilience, metrification, or logging, all of which we would need to repeat several times for each of the different languages used in our pipeline.
What we really needed was a smarter load balancer.
We needed a layer 7 (L7) load balancer because they operate at the application layer and can inspect traffic in order to make routing decisions. Most importantly, they can support the HTTP/2 protocol.
There are many different options for L7 load balancers including NGINX and HAProxy, but most proved too heavyweight to easily drop into our microservice architecture. We whittled down the choice to two key contenders — Envoy and Linkerd. Both were developed with microservice architectures in mind and both had support for gRPC.
Whilst both proxies had many desirable features, our ultimate decision came down to the footprint of the proxy. For this, there was one clear winner. Envoy is tiny. Written in C++11, it has none of the enterprise weight that comes with Java based Linkerd.
Once we’d decided on Envoy, we started drilling down into its feature set, and there was a lot to like.
Envoy was written and open sourced by Lyft, and is the direct result of years of battling with complex routing issues that typically occur in microservice architectures. It was essentially designed to fit our problem and boasts:
That last one was a big one for us. It chimed with Bugsnag’s polyglot microservice architecture.
In Kubernetes, a group of one or more containers is known as a pod. Pods can be replicated to provide scaling and are wrapped in abstractions known as services which provide a stable IP address for accessing the underlying pods. Since Kubernetes 1.2 the default behaviour on hitting a service IP is that a random backend pod will be returned. However you can reconfigure your services to be headless so that the service IP will instead return the entire list of available pod IPs, allowing you to perform your own service discovery.
Envoy was designed to be run as a sidecar container where it sits alongside the client container, supplementing its functionality in a modular way. In Kubernetes, this translated to running the client container and the Envoy container within the same pod. We configured our services to be headless to provide endpoints for Envoy to use for service discovery. And thanks to the large amount of metrics output by Envoy, we were able to easily observe the round-robin load balancing of successive gRPC calls to confirm that it was working as expected.
Whilst we chose to run an Envoy sidecar for each of our gRPC clients, companies like Lyft run a sidecar Envoy for all of their microservices, forming a service mesh. This approach is incredibly powerful, allowing you to adjust traffic parameters at the domain level, and it is something we’ll look to capitalize on at Bugsnag.
Whilst Envoy fit our requirements well, there were a few alternative solutions worthy of mention. Some of these we explored, but they were either too immature or didn’t quite fit our architecture at the time:
Overall, we’ve been impressed with Envoy and will continue to explore its features as we build out and expand Bugsnag. One thing’s for sure, this space is hot in the DevOps world right now and we’re excited to see how the future landscape develops.
Bugsnag automatically monitors your applications for harmful errors and alerts you to them, giving you visibility into your software. You can think of us as mission control for software quality. Try out Bugsnag free for 14-days, including the new Releases dashboard for tracking the health of releases.
Read how Lyft uses Bugsnag to capture every error and prevent bugs from impacting their users.