Testing Microservices: You’re Thinking About (Environment) Isolation All Wrong

Published in

Ambassador Labs

5 min readJan 14, 2022

I’ve been keenly following the Lyft article series presenting how they have scaled their development practices as they adopted microservices and cloud native technologies, and so I was super excited when I saw Cindy Sridharan tweet about the release of the latest installment. The entire Medium article was a great read, but one sentence focusing on providing environments for testing, in particular, jumped out at me:

“We fundamentally shifted our approach for the isolation model: instead of providing fully isolated environments, we isolated requests within a shared environment.”

I instantly had a flashback to an early blog post I had written about “Microservice Testing: Coupling and Cohesion (All the Way Down).”

Isolating environments vs isolating requests: Striving for loose coupling

In my experience working as a software development consultant a few years back, a lot of problems I encountered with testing microservices were related to coupling. Because organizations had accidentally created a highly-coupled system (the dreaded “distributed monolith”), a lot of folks I worked with had attempted to create what Lyft has called the “onebox” — typically a large VM that contained an isolated instance of the entire application for testing. These solutions had all of the advantages and disadvantages discussed by the Lyft team.

As I was thinking about this approach to testing microservices I became conscious of the inherent coupling between developer and environment. What the Lyft team presents in their latest article is reducing the scope of the coupling from the developer and test environment to the developer and a test request. And once you’ve got this looser coupling it’s much easier to operate, maintain, and scale test environments.

Test environment goals: low latency, high realism, no conflicts, and wide shareability

What the Lyft team is pitching is very valuable when performing “outer development loop” component and end-to-end style tests that would typically be conducted against staging. And I would argue it’s even more important when working within the inner development loop on a microservices-based system. When you’re working here you want to be able to quickly test assumptions against external dependencies and verify your implementation integrates correctly with other components and services. You also want to be able to collaborate with your teammates on the work without requiring the burden of too much coordination or conflict resolution.

Three things are important for effective microservice development testing (particularly in the inner dev loop):

Time-to-feedback (latency for testing) is low. This includes making code changes, building, deploying, verifying, observing, etc.
Quality (production realism) of the environment is high. We want to minimize the WTFs when finally deploying to prod.
An ability to securely share our work in progress widely across the team without causing conflicts or breaking other developers’ WIP*

* For the third bullet point, the spectrum of solutions ranges from using a single shared environment with easy access to all but with constant breakage, to individual isolated environments where it’s tricky to coordinate deploys and share access.

Many folks that I look to for guidance in the software world, including the aforementioned Cindy Sridharan, Charity Majors, and Kelsey Hightower advocate for testing in production. I support this as well. However, in my anecdotal experience the number of organizations that are capable of doing this is relatively small (although growing). A lot of organizations like the safety blanket of the staging environment.

This brings us back to the dilemma of creating a shareable low-latency, highly realistic staging environment.

If we invest our energy and resources into creating a single staging environment with SLAs similar to production (as the Lyft team has done), can this be used and shared effectively while maintaining inner dev loop speed? In other words, can we implement the right level of isolation at the request and the service we are working on?

Creating “copy-on-test” services that support request isolation

In their blog post, the Lyft team introduced the concept of “offload deployments”: a user deploys a service to the staging environment, but this service isn’t visible via service discovery. The service can access external dependencies, talk to databases, etc, but it can’t be discovered by other services.

The only way an offloaded service can be reached by an external request is via an “override header”. This header is injected manually and propagated through the application service call chain and a service mesh proxy is configured to redirect (reroute) the overridden requests to the offloaded service. This enables an engineer to get into their inner dev loop and send test requests that target the isolated service without other engineers encountering the service under test when making requests against the shared application. With a bit of care and coordination, the engineer can also share their “override header” with teammates to enable a preview of their work or for pair programming/reviewing, etc.

This approach is somewhat like a service-level feature flag. In the past, I’ve also looked at this method as something similar to “copy-on-write” semantics, but probably better named as “copy-on-test” in this situation. Any service that is under test within the context of the larger application (collection of services) is “copied” within the environment and routed to for testing. This could mean that the service you are actively developing is either: 1) copied and deployed in the remote cluster alongside a mechanism for syncing the changing code files or resulting service binary/artifacts between local and remote dev environments; or 2) copied and deployed locally with a two-way proxy connecting your local dev environment and the cluster — effectively “putting your local development machine in the cluster”.

The way forward

Creating a copy of the service you’re working on is easy enough. The key challenge, of course, is how to enable request isolation when testing while meeting our requirements of low time-to-feedback, high quality of the environment (without stepping on other dev’s toes), and the ability to securely share our work in progress widely across the team.

There are several challenges here. Some, like enabling request isolation, can be overcome with technology like the CNCF tool Telepresence. But some issues will require guardrails and lightweight coordination, typically around mutating state and accessing external third-party APIs. I’ll share my thoughts on this in the next blog post in this series. Stay tuned by following us on Medium and on Twitter!

Check out the full Testing Microservices series:

Part 1: You’re Thinking About (Environment) Isolation All Wrong
Part 2: Isolating Requests (Not Environments) with Telepresence
Part 3: How to Share Staging Environments without Tripping Over Each Other