This post reflects my 3 years of journey, learning and experimenting microservices development. Most of you, software developers can relate to this, tips for those who are new in microservices battle.
Excitement and energy was at peak in this phase, learning cool cutting edge technologies, Docker, AWS, Kubernetes. Microservices was a big buzz. We were cherry picked to explore new way of development, bashing monolithic services, dreaming on how to create small enough microservices.
This was the time when, if its a CRUD, it must be a microservice. All read writes must be done by the same service. If using couchbase, bucket must be guarded by its own microservice.
With low vision requirements and architecture runway, every requirement was getting translated into a microservice.
Few folks believed in creating microservice per API and some of us clubbed multiple APIs, handling different businesses into one microservice.
Even the minimalistic functions were getting translated into microservices
And in few months, we had our very first galaxy of microservices :) so pretty, we drew it ourselves in our HE CoP meeting. Tears in the eyes!
Now the struggle begins, we had services in place in our new cloud platform and it was time to test. Every single microservice was fully functional tested and quality assured, we had hundreds of unit test cases, average 80–90% code coverage and our sonar was happy with green glow but load testing graphs were bleeding in red, not happy at all! For few months, we had long days and sleepless nights, our bonding were at highest, loved and debated my fellow mates the most at this time. SLAs were loud and noisy, our biggest enemies. Questioning and finger pointing architecture and requirements, we started deeper dive to identify the loose ends.
Network IOs were cruel and too many of those exist, we had many hops to serve a request. Service A was calling B, B was calling C and C was calling D and D was calling E and …. in some cases the call chain was more deeper. Every network IO was adding additional 5–10ms.
Where are the timeouts? long wait for calls to return, hogging all shared resources, timeouts were missing.
CPU was high and memory was crashing, even adding more CPU and memory was not much of help. Services were scaling up like crazy, too many pods to meet the load. Oh, I loved memory and CPU profiling of services at this time.
Eyes on fallbacks and circuit breakers, I took this really hard on me, fallbacks for me was like insult to my work, I understand it was needed but couldn’t take it without ensuring we addressed the issues.
Abused couchbase N1QLs, for some of us it was new and exciting to use but it disheartened us.
And also got questioned on connection pooling and keep alive, the most fundamental need was a big miss out.
We were at war. Platform, SDK and Services developers, who to blame, in the end it was the family and alas some breath, at least we identified the areas to improve.
Getting a grip
Everyone was busy sharpening their tools to harden the Platform, SDK and services. We were busy re-designing, re-factoring and in some cases re-writing the services, they say 80% of writing ends up in re-writing, it did applied in our case. With better visibility in architecture and requirements we knew what to develop now. A few of the much-needed improvements were obvious too.
We began applying bulkhead design patterns, sliced services (APIs) per load, business type, processing time, and availability requirements. This worked like a charm and we could isolate the failures from no availability to significant availability of our services.
To respect “service owns its data” and to overcome slow response times from data guarding adapters, we replaced the adapters to allow multiple readers and dedicated writers (data aggregators). This gave us flexibility to achieve fast reads and optimized our writes. All data writings and optimization were done by the dedicated data collectors and aggregators services. These were the data weavers running in the background listening to data updates and keeping our data up-to-date. This way the services who serve customer requests were only reading data.
We started to reduced the network IOs, avoided deep hierarchy of calls and sizing the microservices to be more than just the tiny functions.
We also applied timeouts, connection keep alive, and fallbacks during network failures.
Optimized data for lookups, to avoid couchbase N1QLs, some good brains in my team proposed to leverage secondary documents. Pre-computed and optimized data in secondary documents so real-time customer calls can utilize only look up by ids.
Leveraged caching where data staleness were not much of the concern, lookup from cache and call downstream only in case of cache miss. Kept a watch on data TTL and refreshed it before it expires.
By now our platform and services were finally in a stable state. It was taking the load it was intended to (woohoo!) but more challenges were on the horizon. During this time upper management was worried about stability of environment and started putting up quality gates and other boilerplate processes. No one liked these quality gates, but it seemed essential until we achieve the GOD stability we strived for. Yeah, I like to call it GOD because it’s the hardest, if not impossible, state to achieve.
Now we are in the a very explorative time. We are questioning every aspect of the microservices world, and constantly looking for better approaches to reduce the cost and achieve better stability. As a microservices developer, it wasn’t enough for us to write the service and put it in the cloud. We were the ones debugging and troubleshooting the bottlenecks of the services, so we must learn and contribute to our SDK and platform improvements. The thin line between SDK and service developers started to go away and we, as a community, began attacking problems as a team.
After stabilizing the services our operational costs are still very high. We started exploring cost optimization solutions. GO was quickly identified as one of the sweetest languages that almost inherently would lower CPU usage and memory requirements. We started a POC and produced some data points to prove our theory. To no one’s surprise GO shined in most of the comparisons. Kudos to GO developers, Robert Griesemer, Rob Pike, and Ken Thompson for giving us such a great piece of art.
Functions like AWS lambda are also my favorite to explore, on demand need to functions sound catchy in theory but it is giving away more control to cloud internals. Scared, I might end up writing some bug which will be costlier using functions.
Our services are still bloated with connection pooling, timeouts handling, circuit breakers so we started marching towards better service discovery. There are already few great solutions to remove bloatware such as Istio, Linkerd, and Envoy. Yet to assess these tools or must come up with our home-grown service discovery.
Data caching and pre-computation are great strategies that must be considered while designing services. We have seen tremendous performance improvements by leveraging caching, read from caches and keeping caches up-to-date in the background. We also used CDN solutions to keep the cache and let clients read data from the CDN cache.
Data Pre-computation solutions are awesome if achieved as needed. Optimizing data for fast lookups are great and requires some good thinking. There are tools out there like graph data bases such as Neptune. Neptune’s read capabilities are out shining but I am still not convinced that writing with Neptune is optimal.
Fallbacks seems to be the key ingredient to use, to achieve the services availability and stability, we developed fallbacks even fallback for existing fallbacks too. Triggering functions, such as AWS lambdas, on an “as needed” basis keeps costs low. These functions prove a zero-cost solution when not in use. Thus, eliminating the cost of maintaining and running a service that is rarely used.
New style of development requires new approaches and adaptive way of thinking. Microservices development comes with a new set of challenges. After reflecting on what we have achieved I feel stability is still far away. Microservices give us flexibility to develop and roll fast. They are in general easy to read and enhance the code, but is inherently linked to the fallacies of distributed computing. Many network hops, non-reliable networks lead to fallbacks and circuit breakers. Caching also appears to be critical to achieve much needed fast responses but caching comes with data staleness and inconsistency in distributed environment.
So to conclude, those who are using microservices development are running fast but also losing the grip! But we will keep marching, for the better future!
Thanks to Harman Patial, Robert Kronebusch and Syed Rizvi for review.