I went to some interesting presentations about MicroServices at #qconlondon this week, including one by @tammersaleh of Pivotal who gave a great talk on Microservice Antipatterns, though the topic was less “design” anti-patterns, and more “things that can go wrong, and how to avoid them”. I thought I’d share some of what I heard along with my own thoughts and comments.
Start with a monolith, and extract services at a level of granularity that you need at a time. I’m not sure I entirely agree with this, but it is a recurring comment I heard at QCon, and I guess it’s an antidote to the perennial behavioural trait of developers to “over-model” a domain into a theoretically “pure” but practically over-complex implementation (think endless object hierarchies!). I think the point is not to assume “more micro is better”, and to recognise that there’s a significant operational cost to running and deploying services, so the incremental cost of adding a service had better be worth the benefit of splitting out a more fine-grained level of service.
Microservices themselves define a schema (i.e. their API, their payload schemas, etc), even if you’ve abstracted away database or persistence schemas behind these in a classic (one DB per service, aka Gatekeeper) model. This is kinda stating the obvious as far as I’m concerned, and the natural consequence is that you are then forced into coordinated deployments whenever you want to make a change which destroys half of the power and flexibility of a microservice orientated architecture. There are a couple of good patterns to consider to help mitigate this. Firstly the use of semantic versioning (see http://semver.org/) to make it very clear when you expect a change to be backward-breaking. Semantic versioning can be described as using a version number MAJOR.MINOR.PATCH, and incrementing: the MAJOR version when you make incompatible API changes, the MINOR version when you add functionality in a backwards-compatible manner, and the PATCH version when you make backwards-compatible bug fixes.
@tammersaleh summed this up beautifully and far more succinctly with the alternative definition of a semantic version number: mybad.shiny.oops !!!
Secondly, there’s no need to decommission an old version of a service, so instead run multiple versions and give your client applications and services plenty of time to migrate to the latest version. You can monitor usage to help drive migrations and of course to know when it’s ok to kill the old service without impact.
If you’re designing for a system with any sense of variable load, you can reduce your resource requirements (servers, etc), by using queues and worker-processes between microservices and their dependents (e.g. databases) to spread the load over time (“amortize the load to smooth the traffic”). This of course introduces complexity into the equation if only because there are more components to the overall architecture, but it also forces clients to be asynchronous in their expectations. I think this is a good thing – It’s so difficult to guarantee good performance with synchronous behaviour that it’s surely better to assume asynchronous behaviour and achieve a more graceful degradation if you’re unable to meet your hoped-for performance.
This is another one that I struggle to think that people genuinely do, but I guess it must be more common than I thought. Anyway, there are a couple of alternatives – either using a discovery service (which requires every service to be aware of the discovery service itself, and be coded to use it) or make use of a centralised router, such as DNS or a DNS-aware load balancer.
The assertion (I think correct) is that when one microservice goes down, others that depend upon it in turn become unavailable on unresponsive, ultimately forcing a collapse of the entire ecosystem of services. @tammersaleh suggests that this can be addressed by use of a circuit-breaker, which can be called when one service is found to be unresponsive by another. He proposes that this could work in conjunction with a discovery service to prevent new calls being made to the service, potentially in quite sophisticated ways with back-off management before re-tests of the services, etc. While I can see the issue and the concern, it strikes me that the relative complexity of having to operate a discovery service (see above) and the need for that in turn to be even more complex to manage circuit-breaker functionality makes this only worthwhile when you reach a certain level of complexity and scale in terms of number of services. The natural extension I think would be to have a discovery service that will start a service that it expects to be running, and which it can’t find or which has been marked as unavailable; very quickly this model evolves to a full lifecycle management of services. Of course @tammersaleh is promoting @cloudfoundry so if this sounds like too much effort… well why reinvent the wheel? Just download his tool!
With loads of services involved in the response to a single client request, understanding issues and tracking behaviour through the logs quickly becomes a nightmare, even if you use log aggregation to gather them all into one place. There’s a simple solution to this: add a unique correlation-id to the incoming request, and ensure that it is used by all subordinate services in their logs making it simple to group actions by the correlation-id in the logs. As a general principle if a service receives a request without a correlation-id it should generate one and use it (in its own logs and in all subordinate calls), and if it receives a request already populated with a correlation-id it should use the one provided. This model allows services to serve both an initial service-providers and subordinate service providers.
Here’s an interesting one: each consuming team of a service has to write their own mocks and stubs, which is wasteful. Of course what often happens is that the authoring team of a service writes their own mock and shares it. @tammersaleh suggests that this results in an increase in the service “surface area”, which I guess is true and he proposes that the authoring team write language-specific bindings for their consumers, and in turn ensure those bindings provide mocks, when a request or the client harness is parameterised with a mock flag. He even goes on to suggest that this means that the transport protocol could be changed by the authoring team from one serialisation technology to another (e.g. JSON->Thrift, say). Now this seems like madness to me – the authoring team will be supporting a bunch of static client libraries covering a multitude of old versions. It seems that it’s a fundamental shift away from the API being RESTful at all… what then is the incentive for exposing the HTTP API publicly at all?
Pretty straightforward this: lots of services, lots of interactions, lots of data and it’s difficult to understand. So graph everything you can, and make sure there’s shared visibility of this information so that the team learn to understand on a qualitative and quantitative basis what the expected behaviour of the system should look like in terms of those metrics, and thus what anomalous behaviour should look like; so that they can spot slowly changing metrics and trends like the slowly boiling frog!
Snowflakes are unique, and snowflake servers and services are also unique. This is a Bad Thing ™. Servers should be completely commoditized and standardized, and that should be reinforced with regular reboots and restarts… the danger of a long-uptime server being that you’re frightened to restart it in case there have been too many (or complex) changes to its dependencies that you can’t reliably recreate the snowflake. Differences should be eliminated, all builds should be from a golden image that is shared across the organisation. From a service perspective as well, services should exist in multiples and dependent servers should be agnostic to them be started and stopped (see the circuit-breaker discussion above).
The sort of deploy that everyone postpones out of fear; that people take days out of the office to avoid, etc. This happens when deploys are rare… so the solution is to deploy all the time; make the pipeline predictable and routine; deploy continuously!
This leads to operational hell, so stick to a limited set of tech and automate everything to give yourself the bandwidth to experiment carefully. This point was made also by @aviranm who gave a talk about Devops and Microservices, who suggested experimenting on non-critical services, and taking full production ownership until a new technology had bedded in.