Microservices and Chaos Testing

In my previous blog (Microservices and Testing - Lessons Learned), I talked about variation in the testing pyramid and how new flavors of Component Tests (with same good characteristics as those of Unit Tests) have gained momentum and are the primarily source for testing the entire ยต-service stack in isolation. In this blog I will talk about how we use the same Component-Test setup to run local Chaos tests! 

Microservice architecture is very susceptible to intermittent infrastructure/networking failures which makes Chaos-Engineering a very important aspect for a mature stack. Chaos-Engineering sounds bit chaotic but it is surprisingly easy to do in microservices architecture, specially the local chaos scenarios. To run a full blown Chaos Test, you will need to set aside some dedicated time and bring members from different teams together and run the experiments (Game days!). On the other hand, local chaos scenarios are the ones that you can run locally for your own service stack in isolation. The good thing about a microservice stack is that for a given service, we only care about two things: 1) Immediate consumers and 2) Immediate dependencies. Which, in this case, also means that “local” chaos scenarios should only test immediate dependency failures (and not worrying about the entire dependency graph or cascading failures!). These tests are not about testing entire workflow or user/experience but only to test how a service can handle its immediate dependency failures.

Quick recap from the previous blog - the Component-Test suite typically has the setup shown below. In the center, we have the service under test and with its own data container as a part of the same stack. It talks to a mock container for all its dependency needs. The mock container servers a snapshot that was taken at some point in time in the past as mock. And finally, there is a test-client container that makes the request to the service with different parameters and asserts on the result. (Read the blog here is this sounds alien)


We use the same setup to run our local chaos tests as well! Let take an example - Imagine i want to write a microservice that can calculate profit/loss for stocks based on either a real time pricing feed or SOD (start of day) feed. This information is stored as user-preferences. So now my service has a dependency on /userpreferences GET api. In terms of workflow, this service should receive a request with stocks as input. It should then call the /userpreferences api to get the user preferences. Based on the preferences, it should use the correct pricing feed to calculate the profit/loss.

Now I want to add chaos tests but before I can add them, I need to go through Chaos-Engineering process which is very simple 3 step process
  1. Setup a meeting with the entire team (Including PM) 
  2. In the meeting, discuss the workflow and create a mind-map of dependencies that are called during execution (the happy path!). For each dependency discuss the possible failures and try to come up with a default fallback mechanism. The questions you will need to answer are: 
    • Is this information/dependency critical - meaning in case we fail to get this information - is it critical enough that we should stop processing the request? Can we survive without this information? 
    • If not - Is there any fallback value that can be used? 
  3. How team will monitor/alerted when this scenario is invoked? 
Here is an example of mind-maps for one of the services we own. As a team, we meet twice a month and take one service at a time and discuss its workflows and dependencies and try to answer above questions. PMs also participate in these meetings and they get to see it from a developer’s point of view and together help us figure out acceptable fallback behaviors in case of failures. 


For the example use case above, lets say the team decides that user preferences is a critical information but there is a default behavior possible which is to use SOD pricing feed. Meaning in case if we fail to get /userpreferences ( or if its not configured properly) we should use SOD pricing feed to calculate profit/loss and let the workflow go instead of failing it. Awesome! Ok its now time to capture this behavior in terms of tests so it remains well documented. I am a big fan of BDDs so I would write a test like this: 

The key here is the step to define Chaos Scenario to "/userpreferences GET return 500”. This is a string to uniquely identify the chaos scenario we want to invoke. Using this information, the mock container is now little more intelligent meaning it would first check if the incoming call is running under some scenario - if yes then it would do something different than just returning the result from the snapshot (here return 500 status code). But this information need to be propagated all the way to the mock container from the test-client container without any hack. To accomplish this, we hijack some key fields in the http request header like X-Request-Id. When the test-client container makes the request to the service container, it will populate X-Request-Id with this unique string. The service container doesn’t really care about this field (except for logging) but makes sure to pass this header to any downstream calls. The mock container then receives this request and first check the header to figure out what to do! Simple right!! Here is the diagram to show how it works: 

Chaos Engineering is fun and a great team building exercise where the entire team (including PMs) come together and try to create a robust service offering. There is no bigger reward than watching chaos scenarios triggered live in production and see the service continues to work as expected!

Happy Testing!

Popular posts from this blog

Break functional and orchestration responsibilities for better testability

An Effective TTL-Caching Pattern

Microservices and Tech Stack - Lessons Learned