Tuesday, October 2, 2018

Why microservices fail?

Adoption of microservices is not easy and I have had my own share of painful experiences during the initial phase of the transition. In this blog, I will go over some of factors that can be critical to determine the outcome of such transition.

Lack of Engineering Culture - What’s an Engineering Culture anyway? Its a culture that practices lean-principles and further enables osmosis style learning for everyone. There is recognition for great work and people are willing to learn from each other. Its a culture where everyone is empowered and feels responsible to participate/evolve/improve all aspects of the SDLC irrespective of their primary role. You can have specialized teams for the different parts but everyones feedback is welcomed. Processes are not followed like a religion but evolved throughout the lifecycle.

Adoption of microservices will be a long learning curve for everyone and if an organization doesn’t have the right Engineering mindset, avoid doing microservices. You have even bigger battles to fight first.
Patterns to look for:
  • Not enough buy-in from other specialized teams or takes significant effort to get them involved. 
  • Hands off model followed b/w PM->Dev->QA->Release 
  • Process are followed like religion with minimum to no input from other teams. 
Microservice is just another architecture - The mindset that microservices is just another architecture that only concerns the Dev team is completely wrong. It has a widespread impact on the entire SDLC lifecycle and all the different teams involved. It will impact how you design, build, test, deploy and even market your product. For a successful adoption, It requires significant organizational team restructure in order to bring different specialized team closer and embrace for change.
Patterns to look for
  • Process teams over product teams. 
  • Lack of decision making power to make structural changes within the organization. 

Lack of Expertise - Few months back I was working on a home project of converting my basement’s carpeted floor into a hardwood floor. It was my first attempt and I made some horrible mistakes. I was sharing my experience with a colleague and he said something very interesting - “If you doing a home project for the first time - the first attempt should be at your enemy’s house because you will make lots of mistake while learning the tricks. Then do it on your friend’s house - reason being you may have learned the lessons but not an expert yet. The third attempt should be on your own house and at this time you will be a pro”. This is very true for microservices also. Its an acquired skill and I am sure teams will make lots of mistake when doing it for the first time. Don’t try to solve your hardest problem in the first attempt - go from easy areas and further move on as you gain more confidence. The lack of expertise could also be at the domain level meaning if you are exploring a new product for the market fit, don’t start with microservices. To get a hang of the domain and understand boundaries with bounded context will take time.
Patterns to look for:
  • Lack of domain expertise (Explorers versus Settlers). 
  • Level of expertise in building microservices 
    • General rule of thumb - If the team has crossed the mark of 5 services in their portfolio, then only they know something about microservices. 
Developer and Testing Experience - I have seen lots of examples in terms of blog post and videos, where companies tried to adopt microservices and have had some good initial success but later abandon it in the middle. The main complains so far has been around Developer Experience and lack of good isolated Component Tests. If you are doing microservice and require a developer to deploy the entire cloud on his local machine before he can be productive, you are doing it wrong! If you cannot test your service in isolation, you will never be able to gain enough confidence and things will fall when all come together. Invest into Developer experience early in the process. Create services that has well defined set of responsibilities and can be tested in isolation.
Patterns to look for:
  • Timespan b/w getting latest code to point where you have a working system (Ideally should be couple of mins.) 
  • Team’s confidence level in pushing code to prod intraday. 
  • Team’s confidence level in automated testing in place. 
DevOps at the core of the team work ethics- DevOps can be sliced and diced in different ways but at the core, it requires developer to be ready to support their code in production anytime of the day. Gone are the days when developer were only responsible for writing the code. Today, we need to think more like service providers and be ready to support both internal and external consumers with the same priority. Teams also need to monitor their services real time and clearly identify the good/bad/normal operating scenarios. Or put it in simple terms “You build it, you run it” - Werner Vogels
  • Teams can monitor realtime system-health and performance 
  • Realtime alerts are setup for errors. 
  • Space for blame-free postmortems 

Other than cultural implications, there are more complex technical problems to deal with around distributed transactions and embracing eventual consistency. So why adopt microservices when its so hard? I think of microservices as level of maturity. In the past when industry moved from 2-tier (Client-DB) application to SOA, it probably went through the same sort of arguments. In my mind, its the next level of maturity in building modern software. If adopted successfully - you are well on your way to transform the industry at an accelerated pace!

Monday, September 17, 2018

Microservice Practices - Dark Launches and Feature Toggles

Dark-launches and Feature toggles are two great practices to deploy code faster to production and get early feedback. Combined with microservices architecture, it becomes a very powerful tool and once adopted, there is no turning back! You will be thrilled by the speed with which you can build and ship software in a safe manner! In this blog, I will try to explain what these terms means.

Dark Launch
Dark launch is a practice where you can test/validate your new functionality in production without having to worrying about disturbing the original workflow. This technique is very useful when writing a new version of the functionality or even adding more functionality that can be triggered based of any existing workflow. The steps to achieve it are very simple:
  1. Deploy the new functionality in prod. 
  2. Incase if it is new version of an existing functionality and you want to test correctness: 
    • clone the request going into v1-logic —> then call the v1-logic —> clone the v1-result. 
    • Now in the background (fire and forget) call the new logic with cloned v1 request and compare v2-result with v1-result. Since this happens in background, there is no impact on the original workflow 
  3. In case if you want to test new functionality invoked of an existing workflow 
    • During the execution of the workflow, assemble the information you need in order to call the new functionality 
    • In the background - make the call to the new functionality and start collecting metrics. 
  4. If storage is involved - you can ignore it (if its dump) or do cleanup later (in case if there is some logic involved). Dump storage is preferred!


This practice allows you to collect some key metrics around correctness, robustness and performance of the new functionality even before the code is in use for real. Since the call to the new functionality happens in the background, it does not impact the original workflow! How awesome is that! You can ship the code faster to the production and get meaningful insights without breaking any of your clients! 

Feature Toggles
Feature toggles is a practice of enabling a new functionality for selected group of consumers at a time. This technique allows you to do selective rollout and also restricts the blast radius when things do go wrong with the newly deployed code. You can imagine dark-launch as a feature also and selectively enable it for set of clients. Toggles are created and stored external to the application and development team has full control over them. Feature toggles and branch-by-abstraction (which is well documented here by Martin Fowler) go hand in hand. In simple terms, all it is saying that before invoking a new piece of functionality, first check if the user has toggle-on for it. If yes - then call the new path else continue to call the old path. 

I work in Stock-market domain where one mistake could potentially cost our clients millions of dollars and I have used these techniques for all sort of work - from small rewrites/features to complex tasks of migrating storage from SQL-server to mysql. The results have been amazing and at the end, we have delivered a robust, stable and performant product offering. 




Tuesday, September 11, 2018

Microservices and Chaos Testing

In my previous blog (Microservices and Testing - Lessons Learned), I talked about variation in the testing pyramid and how new flavors of Component Tests (with same good characteristics as those of Unit Tests) have gained momentum and are the primarily source for testing the entire µ-service stack in isolation. In this blog I will talk about how we use the same Component-Test setup to run local Chaos tests! 

Microservice architecture is very susceptible to intermittent infrastructure/networking failures which makes Chaos-Engineering a very important aspect for a mature stack. Chaos-Engineering sounds bit chaotic but it is surprisingly easy to do in microservices architecture, specially the local chaos scenarios. To run a full blown Chaos Test, you will need to set aside some dedicated time and bring members from different teams together and run the experiments (Game days!). On the other hand, local chaos scenarios are the ones that you can run locally for your own service stack in isolation. The good thing about a microservice stack is that for a given service, we only care about two things: 1) Immediate consumers and 2) Immediate dependencies. Which, in this case, also means that “local” chaos scenarios should only test immediate dependency failures (and not worrying about the entire dependency graph or cascading failures!). These tests are not about testing entire workflow or user/experience but only to test how a service can handle its immediate dependency failures.

Quick recap from the previous blog - the Component-Test suite typically has the setup shown below. In the center, we have the service under test and with its own data container as a part of the same stack. It talks to a mock container for all its dependency needs. The mock container servers a snapshot that was taken at some point in time in the past as mock. And finally, there is a test-client container that makes the request to the service with different parameters and asserts on the result. (Read the blog here is this sounds alien)


We use the same setup to run our local chaos tests as well! Let take an example - Imagine i want to write a microservice that can calculate profit/loss for stocks based on either a real time pricing feed or SOD (start of day) feed. This information is stored as user-preferences. So now my service has a dependency on /userpreferences GET api. In terms of workflow, this service should receive a request with stocks as input. It should then call the /userpreferences api to get the user preferences. Based on the preferences, it should use the correct pricing feed to calculate the profit/loss.

Now I want to add chaos tests but before I can add them, I need to go through Chaos-Engineering process which is very simple 3 step process
  1. Setup a meeting with the entire team (Including PM) 
  2. In the meeting, discuss the workflow and create a mind-map of dependencies that are called during execution (the happy path!). For each dependency discuss the possible failures and try to come up with a default fallback mechanism. The questions you will need to answer are: 
    • Is this information/dependency critical - meaning in case we fail to get this information - is it critical enough that we should stop processing the request? Can we survive without this information? 
    • If not - Is there any fallback value that can be used? 
  3. How team will monitor/alerted when this scenario is invoked? 
Here is an example of mind-maps for one of the services we own. As a team, we meet twice a month and take one service at a time and discuss its workflows and dependencies and try to answer above questions. PMs also participate in these meetings and they get to see it from a developer’s point of view and together help us figure out acceptable fallback behaviors in case of failures. 


For the example use case above, lets say the team decides that user preferences is a critical information but there is a default behavior possible which is to use SOD pricing feed. Meaning in case if we fail to get /userpreferences ( or if its not configured properly) we should use SOD pricing feed to calculate profit/loss and let the workflow go instead of failing it. Awesome! Ok its now time to capture this behavior in terms of tests so it remains well documented. I am a big fan of BDDs so I would write a test like this: 

The key here is the step to define Chaos Scenario to "/userpreferences GET return 500”. This is a string to uniquely identify the chaos scenario we want to invoke. Using this information, the mock container is now little more intelligent meaning it would first check if the incoming call is running under some scenario - if yes then it would do something different than just returning the result from the snapshot (here return 500 status code). But this information need to be propagated all the way to the mock container from the test-client container without any hack. To accomplish this, we hijack some key fields in the http request header like X-Request-Id. When the test-client container makes the request to the service container, it will populate X-Request-Id with this unique string. The service container doesn’t really care about this field (except for logging) but makes sure to pass this header to any downstream calls. The mock container then receives this request and first check the header to figure out what to do! Simple right!! Here is the diagram to show how it works: 

Chaos Engineering is fun and a great team building exercise where the entire team (including PMs) come together and try to create a robust service offering. There is no bigger reward than watching chaos scenarios triggered live in production and see the service continues to work as expected!

Happy Testing!

Monday, August 27, 2018

Microservices and Testing - Lessons Learned

In my previous blog (Microservices and Testing pyramid), I talked about the variation in the testing pyramid which is more visible in the microservice stack. In this blog I want to provide some more context as to exactly what is contributing to this variation and share some of the lessons learned in evolving our testing strategy. 

A traditional testing pyramid has 
Unit Tests at the bottom followed by component tests and finally Integration and UI tests. There are more pictures with further breakdown available over the internet, but this is the crux of it - heavy on Unit Tests then Component Test and finally Integration and UI tests. 

 Since Unit Tests carry the heavy weightage in the testing pyramid, any significant variation to it has to touch the Unit-Tests and anything that touches Unit-Tests has to have the same good characteristics that we love about them:
  1. They are Fast and Stable (here stable means they break only if there is a real failure and not some flakiness in the env setup) 
  2. Easy to write and maintain (especially if you do it during the development time) 
  3. Trigger on every check-in 
  4. Easy Setup (every language has a natural support for Unit tests) 
  5. Nice code-coverage reports 
To better understand it in the context of microservices, let’s take an example of a common use case. The most common use case would be a service that receives an incoming request. In order to process this request, it will require some data from its dependent APIs. It will make the call to get the data and use it to further process the request and finally store the result into its own storage and server the result back to its caller. 
What are the challenges? When we write any test, we are essentially doing AAA which is Arrange, Act and Assert. Arrange is essentially the mocking. We need to make sure all the dependencies are satisfied before we can call the piece of code that we want to test. Act are the parameters that we want to use in order to test the functionality and Assert is the validation we need to do on the result. Arguably Arrange is the most difficult of all. Mocking is difficult!

Now let’s see some new techniques in terms of Arrange/Mocking that are becoming more popular these days. For any microservice, there are only two things we care about -1) Its immediate consumers (which are the incoming requests) and 2) its immediate dependencies. We don’t have to worry about the entire dependency graph but only its immediate dependencies! To accomplish this, teams are getting into practice of creating consistent snapshots. There are many ways to do it e.g tools like crawler (which can crawl to immediate dependencies and download data in JSON format - more suitable for get style operations) or recorder (which can record an API call and replay it when needed - more suitable for post/put) are becoming more popular. These snapshots are then checked-in into the source control with the service. So, when you download the service code, you are not only getting the code but also a snapshot of all its dependencies that was taken at some point in time. These snapshots are then further used for mocking. The key here is to be able to take consistent snapshots meaning- let’s say a service has a dependency on an API and it consumes two fields. If tomorrow it exposes one more field which the service needs to consume, then it will take a new snapshot and this new snapshot ideally should not override the values of two earlier fields. Otherwise assertions would start to fail!

In a more abstract term, what we are doing is essentially creating a well-known dataset in terms of snapshots of the immediate dependencies. Your team should have full control over it and no one else should touch it. These datasets are usually small in nature due to the fact that we only care about the immediate dependence and not the entire environment! Once we have these snapshots/datasets, then it becomes very easy to create a Component-Test suite which looks something like this: 
The service under test is the middle container and comes with its own data container as a part of the same stack. The only difference b/w service running in production versus in test mode is that in test mode - it is talking to a mock container instead of a real environment. The mock container serves the snapshots we downloaded in the earlier step as mock. And finally, there is a test-client container which is responsible for calling the service API with different sets of parameters and can further validates the response produced by the service.

Here is a screenshot of the test-run result of an upcoming microservice stack I am currently working on. I am a big fan of BDD and the test-client container is written in cucumber-js framework. From the screenshot, it ran 195 test cases (with 962 steps) in about 19 seconds. It spits out this nice code coverage report when all the tests are passed! Yes, it’s possible to get code coverage from a running service - just needs bit of different configuration/hack depending on the language. 

So now let’s compare this Component-Test setup with the Unit Tests:

  1. They are Fast and Stable - Well not as fast as unit tests but since it’s a microservice already, the number of test cases in its own pipeline would be much smaller compared to monolithic. Even if there are large number of test case, lets say 5000, it can still run under few mins. So comparatively fast enough! Stable - because there is no external dependency! These are all local containers talking to each other. 
  2. Easy to write and maintain - When we first started adopting this idea of consistent snapshots, I wasn’t sure how long we can run with it. But it has been few years now and in practice it has turned out to be easy to adopt and maintain. Makes mocking very fun! 
  3. Trigger on every check-in - All our Jenkins pipelines are setup in such a way that it kicks off Component-test suite on every check-in and fails the build if any of the test case is failed. 
  4. Easy Setup - Anyone who knows containers - knows how easy it is to create a setup like above and make these container talk to each other. 
  5. Nice code-coverage reports - As shown in the screenshot, it is possible to get code coverage reports from a running service. 
So finally, the variation in the testing pyramid looks something like this: 
In a way, it’s really these powerful set of tools and techniques coming together to create Component-Test suite which has sort of the same good characteristics as that of the Unit tests. In terms of adoption, It has been practically exponential within our organization. 

Lessons Learned

Variation in the testing pyramid
As explained above - we are seeing more and more stacks coming-up that are very heavy on Component-Tests as the primary means of testing followed by some Unit Tests and rest of the pyramid follows.

Component Test Suite should be easy to spin up and tear down locally
Two things which can boost developer’s productivity to the next level - 1) Very simple local dev setup (checkout blog Microservices - Developer Experience blog) and 2) a very simple way to test your service in isolation w/o worrying about the entire environment. Snapshot style mocks helps in creating fast and reliable tests. It helps in cross team development also meaning - if I need to create a small pr for other team’s service, reviewer and I, both would feel much more confident that it wouldn't break anything else if all their existing use case pass in a fast and reliable way.

Reliable and Fast Components tests payoff well in terms of Ease of Refactoring and Portability
In microservice architecture - when evolving a new version of the API, it’s very common thing to refactor an entire service code or rewrite it from scratch in a different language. I have personally done it so many times! In such fast paced development, these Component-Tests become very important because they become portable! You can completely rewrite a service in a different language but still reuse the existing tests by just swapping out the old service container with the new one! In traditional stacks that are heavy on Unit-Tests only, this level of refactoring becomes very difficult since Unit-Tests are non-portable in nature.

Consumer driven tests are becoming more important than before
Everytime we are building a new version of the API, it very important to make sure the existing consumers continue to work as expected. For this, we allow consumers to submit their actual use-case into our pipelines in terms of Component-Tests. This helps in two ways - 1) We know what part of our API are actually used and it helps with the refactoring 
(if you have a GraphQL api - it becomes even simpler to track) and 2) On every checkin - we can make sure we never break their original use-case. 

In the next blog, i will discuss how we have evolved our Component Tests to reliably test Chaos scenarios.
Happy Coding!

Friday, August 17, 2018

Microservices and Developer Experience - Lessons Learned


In this blog, I want to discuss how adoption of microservices as the mainstream development practice has impacted our Developer Experience and share some of the observations and lessons learned. We have come a long way in terms of evolving our tech stack from being a C#/.net heavy Monolithic Enterprise shop to now lean micro services. Today we have hundreds of microservices written in different stacks with most popular languages being Golang, Python, Nodejs and C# (dot-net core) and everything gets deployed as a container!

Looking back, during Enterprise/Monolithic days let’s say I want to work on a story, my developer experience would look something like this:


I would come to office in the morning and drink my coffee (most important part of my day!). Download monolithic code + build it + restore DB register + services and all…. basically, trying to get to a point where I have a working system. In monolithic you could spend somewhere between few hours to an entire day from the point where you download latest code to a point when you have a working system. Personally, I have spent hours and hours of my day on this messy and time-consuming process. I would take it even further where I will not even download the latest code for as long as I don’t have to and that could for weeks. I would stay on the old version of the code and work from there because I don’t want to kickoff this time-consuming process again.

Since the adoption of microservices things are different! Today when I want to work on a story, my developer experience looks something like this:


I come to office->drink my coffee --> download the stack I want to work on --> run docker-compose up --> work on the story! And at this time, I am still drinking my morning coffee!
That’s it! I have a working system in just few mins. This is hugely different and super awesome! Earlier I would not even download the latest code for as long as I don’t have to and if I did, it would take me hours to bring up a working system. But today the first thing I do is get the latest code and getting a working system is as simple as docker-compose up (few teams prefer bash, so they provide a script like service-up.sh or similar). Here the definition of working system is such that the stack I want to work on is deployed locally on my laptop. Its connected to a share environment for all its dependencies. Also, you will need to figure out routing from app meaning if you are logged in yourself then all requests are routed in such a way that it will be sent to local instance of service (for the APIs its serving) and use the shared environment from rest of the dependencies. 

Lessons Learned:

Getting to a simple local setup is difficult but consider it MISSION-CRITICAL!

Your local setup has to be as simple as docker-compose up or service-up otherwise it’s too much of productivity lost. Consider it mission critical! What about debugging? - When local setup is also containers it’s hard to debug and you cannot attach any IDEs to the process. One obvious answer is to not run the local service inside container and continue to use your standard way of debugging. Other better option is to learn to debug from logs. It helps you in two ways:
    • If teams are paying attention to logs at development time that means they are clean and meaningful.
    • During time of crises - you don’t have to go back and open service code to see what is it that my service actually logs. You probably have a very good idea already.
No big up-front designs.

This is an interesting one. More and more teams are moving away from doing big upfront design and instead taking evolutionary approach. They will quickly spin up a service that solves their current need and evolve it as the need arises. What’s the problem with big upfront design anyway? The way I think about it is that with investing too much into big upfront design - you are emotionally attaching yourself to take care of this service for years and years to come because you are spending so much time doing future proof designs. Evolutionary design is a much better approach. We have teams who are on 5th version of their API in a short amount of time and each new version is completely refactored or rewritten in a different stack compared to previous one. And this is OK because they are able to move faster! Btw this practice of massive refactoring or rewriting a service is a very common thing in microservice world. I personally have done it so many times! Good luck getting permission for this in enterprise but a very common practice in microservice architecture.

Onboarding a new developer is much easier on µ-service stack.

This is sort of a no-brainer. If you have a microservice with a well bounded context then it’s easy for a new developer to wrap his head around what is it that this service actually does.

Massive increase in cross team PRs!

In a way it is sort of a variation of the earlier one. We are noticing massive increase in cross team PRs. Specially for small things like let’s say I have a dependency on an API and I consume 2 fields from them. Now if I need one more field, I will get their stack locally and see how they are doing it and if it is simple enough I will just create a PR for them. Earlier this could have been a small project in itself but today it’s just a PR away! Massive increase in cross team PRs.

The single most important thing that could boost developer’s productivity is providing a simple Developer Experience that one has to deal with on a daily basis. Adoption of microservices has helped us take this experience to the next level where teams being more productive and able to take quick local decisions and move faster. As managers/leader investing into Developer Experience is probably the best thing we can do for our teams!

Happy Coding!