Microservice Caching Patterns

What is a cache?

A cache is, in my definition, a store of some data. This data could be the actual copy or a facsimile of some data stored somewhere else and held for a duration

A caching solution implementation hold this information in memory (volatile) or offload to a persistent store. When not holding the true copy of the information, invalidation or refreshing becomes an issue

The internet was built on caches and if you do microservices then you should know about how to use one to save your life! For example, Web Portals implementations die an early death during performance testing because the backend response to requests “took a long time”

Example of two microservices for a Portal channel serving straight from the system of record

Know how to implement caches, especially for responses and when to use what type of cache to build an effective solution and a snappy application!

Cache types and patterns

  • Persistent vs Volatile: Does it remain when the power is turned off?
  • Cache-aside vs Operational data: Is it the source of truth or does it periodically refresh this information from another source
Cache Storage types

Invalidating a cache

Simply the hardest problem in Computer Science. Knowing when to clear your copy of the data is key, especially when you do not master the information

Some of the invalidation strategies are

  • Events from the master: Subscribe to refresh events or full updates from master system
  • Periodic refreshes: Use a timer to refresh your cache (especially if there are known update cycles in the master data system)
  • Explicit invalidation: Use an API to clear your cache
Cache implementation patterns

Know when to cache ’em

  1. Availability: It is 2020 and your users want a response now! Drop-downs need to be snappy, lookups in O(1) and fewer network calls across the pond
  2. Reliability: You like your consumers but not enough to let them smash the heck out of your core business systems. Self-service portals/mobile-apps backed by APIs are more vulnerable to scripted attacks and if your service is lazy and always going to the system-of-record then you are risking an outage
  3. De-Coupling: Separation of concerns – Command vs Query. You want to isolate the process that accepts a request to “creating/updating” vs “reading” data to reduce coupling the two contexts. This is to prevent scenarios where a sudden rush of users trying to read the state of a transaction do not block the creation of new transactions. For example, Order booking can continue even if there is a flood of requests for order queries
Applying Caching on Queries

Know when not to cache

  1. Consistency: The application needs to serve the current true state of the information (How much money to I have now vs my transaction history)
  2. Cache Key ComplexitySearches are hard to cache because they generate large and complex set of search keys for the results. For example,  consider implementing a cache for a type-ahead search where each word typed is a callout to a search API. The result set for each word would require a large memory footprint and is notoriously hard to size for. A better approach is to only cache the individual item (resources) returned in the result set or array and not cache the search result array
  3. Ambiguity: This relates to consistency. If you do know know when to refresh your cache, especially if you are not the master system or if this information changes in real-time then look for other solution options. For example, a website has a system that updates a user’s account balance in real-time (betting and gambling?) and the Account API for the user is looking to scale to hundreds or thousands of user requests per second (Melbourne cup day?) – would you cache the user’s account (money) information or look at some other strategy (streaming, HTTP push)


One of the API anti-patterns is going straight to the system of record data, especially for retrieving data in public facing web applications. The best way to serve information is to understand where on the spectrum of static to dynamic does it sit and then implement a solution to serve the data with the highest degree of consistency

Static non-changing resources are served by CDNs. The next layer is reliant on your APIs and how effectively you implement caching. Hope you got a taste of the types of cache and strategies in this post. There is certainly a lot more to caching than I have talked about here, the internet is a giant knowledge cache – happing searching!

Open APIs != Open API Specification

RESTful APIs can be internal (your company’s only) or public facing (Twitter). Thus internal APIs are called “Private APIs” and open to the public APIs are called “Open APIs”

Now, while building an API accelerator for our clients I was asked by a well meaning colleague if this was an Open API; the intent was right but there was a subtle error in the language semantics. I believe what he meant was “did you write the API using Open API Specification” and not “is this API open to the world wide web?”

Open API specification or OAS is derived from Swagger and the de-facto standard for writing APIs. You can write public facing or internal APIs with OAS. Simply picking this style does not make your API open to the public i.e. you need to host this API on a public portal to make your Open API Specification based API Open to the public


Just because I wrote my APIs using RAML (the other standard) does not make it closed or non-standard. The Open API Specification is a good standard and you can convert from RAML to OAS

It is important to write an API specification and do it well regardless of the specification language

Implementing Stateful Process Adapters: Embedded BPMN or AWS Step Functions

Just based on recent experience, I am going to put this out there – AWS Step Functions are great for technical state machines which move from one-activity to another but not really designed for stateful process orchestration and definitely not for implementing SAGA

Serverless Step Functions from AWS or BPMN Engines?

When building microservices, the Mulesoft type platform lets you do a lot of the “stateless” request/response or async interfaces really well. But for “stateful” things, especially ones where we need the following, I think AWS Step functions are a half-baked option

This is because there are good embedded BPMN engines that can do the following:

  • Do stateful end-to-end flows and show them in a dashboard
  • Do stateful flows with activities with Synchronous or Asynchronous (request/response i.e. one-way request and then a wait for a message) actions (with AWS step functions, you code your way out of this)
  • Do out-of-the box RESTful APIs for starting a process, getting the tasks state for a process or pushing the state forward etc
  • Do business friendly diagrams
  • Do operational views with real-time “per process” view of current state or amazing historical views with heat-maps 
  • Easy to manage and maintain by the lowest common denominator in your team – lets face it, the cost of maintenance depends on the cost of your resource supporting it and not everyone is AWS skilled and cheap

The only argument I had heard for AWS was that it was better than the embedded BPM engines because we did not need to manage a database. We threw that argument out when our Step Functions had to use DynamoDB to handle storing the complex state

Screen Shot 2020-08-19 at 3.36.40 pm

Comparing the two offerings


Given my experience at a few clients with embedded BPM Engines and AWS Steps in implementing Long Running processes, I have found that Step Functions are great at doing simple state transitions but not easily maintainable and operable with issues around handling async activities and roll-backs – they can be done but you need to code for it!

The existing light weight BPM engines like Camunda offer a better alternative with self-managed and even hosted options and I love they way they present the process states visually especially the heat-maps with historical information

If you want a lot of simple state machines with scale – pick the serverless option but if you want a solid orchestration option, my preference is using BPMN engines like Camunda


Integration Entropy: Identifying and computing the hidden complexity within your systems

Hello! This one is going to be short and less formal a post. I want to get these questions out there before they eluded me and then come back later to this post (or another) and answer some of these

I have been thinking about how we are putting out more integrated solutions now than 10 years ago and how two implementations with the same number of systems grow differently over time to be more or less nimble and more important grow to more or less chaos with issues around Data integrity, Operations etc

I want to apply formal analysis to compute the amount of “hidden information” (entropy) in our implementations both from the systems and integrations perspective and from the data flowing through it

It is like calculating how ordered the arrangement of a set of marbles will be bouncing over a platform supported by jointed arms

Screen Shot 2020-08-19 at 2.50.44 pm

Source: https://www.nutsvolts.com/magazine/article/the_flying_marbellos

What’s with the Topic?

Okay. Entropy is close to me because I love Time Travel and in reality you cannot time travel become of the 3rd Law of Thermodynamics (darn you Boltzman) which says “Entropy Always Increases”

While entropy has to do with heat, it actually talks about the amount of order in a system and the fact that you cannot go back to order from a chaotic arrangement (broken cup) without expending energy

Thinking about this a little deep, if we consider the systems and integrations we build then there is a state where they are in perfect order and with each transaction integrated solutions move towards chaos (or disorder)

Fun Experiment: Try doing this with post-it notes and 2 / more kids as the nodes – ask one child to tell the other to do something and pass the notes as information etc. Things will get chaotic over time!

So what?

More nodes, edges, data in edges = high degree of freedom or possible states for the system to be in and this is the entropy. I argue that well integrated enterprise systems feel easier to manage and operate because they requires less energy to bring to order – this is for the integration platform. I am trying to think about what this means for the data that these integrations bring – one hypothesis is that the type of integrations determine the quality of data as coupling contexts could cause data chaos

Wait what?

Yep. I think given the same set of enterprise systems and same integration product – the initial conditions in data (what the systems have) and the integration design determine how much Data Entropy an organisation has over time. The Integration Entropy may be the same but the Data Entropy could be different

What’s next?

We want to come up with a way to measure this complexity for a proposed or existing solution

Here are some of my notes and questions, feel free to ping me your thoughts

  • Integrations add complexity
  • We are doing more integrations because the Web has come to the Enterprise
  • 10 years ago we were struggling to deliver more than 5-10 end-to-end integrations in a 12 month period due to environments, team structure, protocol issues etc. We have accelerated with DevOps, PaaS, APIs, Contracts and better tools to deliver end-to-end solutions ( we built 100 interfaces in 10 months )
  • More end-to-end contextual micro-services
  • More moving parts
  • Greater degree of freedom in which clients use these services and integrate
  • What is the amount of Entropy or hidden information about the integration complexity now vs before?
  • Does the adopting of RESTful pattern to explicitly show context in APIs simplify this vs RPC style where the context is implicit in the data?
  • What determines the complexity then?
    • Number of nodes
    • Number of edges
    • Number of contexts per edge (nuanced or direct use)
    • Amount of data per edge
    • The type of edge – sync request/response, sync one-way, async one-way (event), async request/response

Better Digital Products using Domain Oriented APIs: The Shopping Mall Metaphor

APIs are the abstractions over technical services. Good APIs mirror strategic thinking in an organisation and lead to better customer experience by enabling high-degree of connectivity via secure mechanisms

Too much focus is on writing protocols & semantics with the desire to design good APIs and too little on business objectives. Not enough questions are asked early on and the focus is always on system-system integration. I believe thinking about what a business does and aligning services to leads us to product centric thinking with reuseable services

As an ardent student of software design and engineering principles, I have been keen on Domain Driven Design (DDD) and had the opportunity to apply these principles in the enterprise business context in building reusable and decoupled microservices. I believe the best way to share this experience is through a metaphor and I use a “Shopping mall” metaphor with “Shops” to represent a large enterprise with multiple lines of businesses and teams

Like all metaphors – mine breaks beyond a point but it helps reason about domains, bounded contexts, APIs, events and microservices. This post does not provide a dogmatic point-of-view or a “how to guide”; rather it aims to help you identify key considerations when designing solutions for an enterprise and is applicable upfront or during projects

I have been designing APIs and microservices in Health and Insurance domains across multiple lines of business, across varying contexts over the past 5-8 years. Through this period, I have seen architects (especially those without Integration domain knowledge) struggle to deliver strategic, product centric, business friendly APIs. The solutions handed to us always dealt with an “enterprise integration” context with little to no consideration for future “digital contexts” leading to brittle, coupled services and frustration from business teams around cost of doing integration ( reckon this is why IT transformation is hard )

This realisation led me to asking questions around some of our solution architecture practices and support them through better understanding and application of domain modeling and DDD (especially strategic DDD ). Thought this practice, I was able to design and deliver platforms for our client which were reusable and yet not coupled

Domain Queries 

In one implementation, my team delivered around 400 APIs and after 2 years the client has been able to make continuous changes & add new features without compromising the overall integrity of the connected systems or their data

Though my journey with DDD in the Enterprise, I discovered some fundamental rules about applying these software design principles in a broader enterprise context but first we had to step in to our customer’s shoes and ask some fundamental questions about their business and they way they function

The objective is to key aspects of the API ecosystem you are designing for, below are some of the questions you need to answer through your domain queries

  • What are your top-level resources leading to a product centric design?
  • When do you decide what they are? Way up front or in a project scrum?
  • What are the interactions between these domain services?
  • How is the quality and integrity of your data impacted through our design choices?
  • How do you measure all of this “Integration entropy” – the complexity introduced by our integration choices between systems?

The Shopping Mall example

Imagine being asked to implement the IT system for a large shopping complex or shopping mall. This complex has a lot of shops which want to use the system for showing product information, selling them, shipping them etc

There are functions that are common to all the shops but with nuanced difference in the information they capture – for example, the Coffee Shop does “Customer Management” function with their staff, while the big clothes retail store needs to sell its own rewards point and store the customer’s clothing preferences and the electronics retail does its customer management function through its own points system

You have to design the core domains for the mall’s IT system to provide services they can use (and reuse) for their shops and do so while being able to change aspects of a shop/business without impacting other businesses

Asking Domain and Context questions

  • What are your top-level “domains” so that your can build APIs to link the Point-of-Sale (POS), CRM, Shipping and other systems?
  • Where do you draw the line? Is a service shared by all businesses or to businesses of a certain type or not shared at all?
  • Bounded contexts? What contexts do you see as they businesses do their business?
  • APIs or Events? How do you share information across the networked systems to achieve optimal flow of information while providing the best customer experience? Do you in the networked systems pick consistency or availability?


Though my journey with DDD in the Enterprise, I discovered some fundamental rules about applying these software design principles in a broader enterprise context. I found it useful to apply the Shopping Mall metaphor to a Business Enterprise when designing system integrations

It is important to understand the core business lines, capabilities (current and target state), business products, business teams, terminologies then do analysis on any polysemy across domains and within domain contexts leading to building domains, contexts and interactions

We then use this analysis to design our solution with APIs, events and microservices to maximise reuse and reduce crippling coupling

A Pandemic, Open APIs and Citizen Science: Its 2020 baby!

Human societies have been hit by pandemics through the ages and relied on the central governing authorities to manage the crisis and disseminate information. I believe this time around with COVID-19, our societies have access to more information from our governments because we have the internet

If this pandemic is an evolutionary challenge, then our response as a species to survive it will come through innovations in medicine, healthcare and technology. Not only will we improve on our lead time to develope vaccines as responses to viruses evolving and but also accelerate key technologies which will help us respond to global challenges as a whole

The internet has allowed governing agencies to share information the spread of COVID-19 in our communities through APIs a common channel in a clean, standardised, versioned, structured and self-describing manner leading to easier consumption by citizen and fuelling the rise of “citizen data scientists

I argue this democratisation of pandemic data via APIs and its consumption leads to new learning opportunities, increased awareness of the spread of the disease, verification of information, better social response and innovation through crowdsourcing

Open Data: NSW Health

The https://data.nsw.gov.au/ provides access to state health data in NSW Australia and in March 2020 provided information about COVID-19 cases on their site here. This website provides a very standardised approach to sharing this information, with metadata in JSON, RDF and XML for different consumers and links to the actual data within the metadata documents

Here is a screen shot of the actual data site

I particularly loved the structure of the JSON metadata because it is quite self-describing, leading to a link to the document with the COVID-19 data

Rise of the consumer: Our Citizen Scientist

It did not take long for someone to come along, parse that information and present to us in a portal we can all relate to. During the early days of the pandemic, I was hooked onto https://www.covid19data.com.au/ and it provided me with Australia / NSW wide information about the spread, categories etc

However it was Ethan’s site that I loved the most as a “local consumer” to see what is happening in my postcode – the website here https://covid19nsw.ethan.link/ is a brilliant example of the citizen science and is sourced from the NSW Health open data link above

Notice the URL and NSW Health agency did not built the portal – it was a this amazing person named Ethan


2020 is an interesting time in our history. We have the pandemic of our era but also better tools to understand its spread. The internet and standardised APIs are at the front and center of this information sharing and consumption

Everyone has the ability now to download information about the spread of the pandemic with time, geolocation and size embedded in this dataset. Everyone now have the ability to write programs to parse this dataset and do their own science on it

Observing distributed systems: Monitoring, Logging, Auditing and Historical analysis

“Knowing the quality of your services at any given moment in time before your customers do and using this information to continuously improve customer experience is part of modern software delivery and critical to the success of organisations”

In this post, we present why it is important to observe and manage our systems and solutions proactively and present the various mechanisms available for observing and reacting. We discuss distributed services observability through monitoring, logging, tracing and contextual heatmaps

TL;DR: 5 key takeaways in this post

  1. Observing distributed system components is important for operational efficiency
  2. Observations types vary based on observation context and need
  3. Monitoring, Alerting, Logging, Auditing, Tracing and Historical views are types of observations we can make
  4. Observation ability is available out-of-the-box for platforms (AWS, Mulesoft etc) and in 3rd party products (Dynatrace, AppDynamics, New Relic etc). Some things you still need to bake into your API or Microservices framework or code to achieve über monitoring
  5. Use observations not just to react now but also to improve and evolve your technology platform


Thanks to efficient software delivery practices we are delivering more integrated solution features and bolting on more integrated systems to accelerate the digital transformations. This means a lot of old internal systems and external services are being wired onto shiny new enterprise services over a standard platform to enable the flow of data back and forth

Keeping the lights on is simply not enough then, we need to know if the fuse is going to blow before the party guests get here!

Businesses therefore need to

  • proactively observe their systems for fraud and malicious
  • watch and act at actively and through passive means
  • regularly interact with their systems as their users would to discover faults before users do
  • track a single interaction end-to-end over simple and complex transactions for faster resolution of complaints and issues
  • and evolve their features by listening to their systems over a period of time

Screen Shot 2020-03-13 at 1.31.20 pm

Observation contexts and approach

I have, over time, realised that how we observe depends-on what we want to observe and when. There are multiple ways to observe, most of us are familiar with terms like Monitoring, Alerting, Logging, Distributed Tracing etc. but these are useful within an observation context. These contexts are real-time active or passive, incident management, historical analysis etc.

Let us look at some of these contexts in detail:

  • To know at any instant if platform or services are up or down then we use a Monitoring approach
  • If we want to be notified of our monitored components hitting some threshold (CPU, heap, response time etc.) then we use Alerting
  • If we want the system to take some action based on monitoring thresholds (scale-out, deny requests, circuit-break etc.) then we use Alert Actioning
  • If we want more contextual, focussed deep-dive for tracking an incident or defect then we use Logging and Tracking (with tracking IDs)
  • If we want to track activity (user or system) due to concerns around information security or privacy then we implement Log Auditing 
  • If we want to detect bottlenecks, find trends, look for hot spots, improve and evolve the architecture etc. then we use Historical Logs, Distributed Tracing and Contextual flow maps 


Monitoring enables operators of a system to track metrics and know the status at any given point in time. It can be provided via out-of-box plugins or external products and enabled on all levels of an integrated solution: bottom-up from Platform to Services and side-to-side from a client-specific service to domain services, system services etc

Screen Shot 2020-03-13 at 3.18.39 pm

A key thing to note here is that monitoring in the traditional sense was driven by simply “watching the moving parts” but with modern monitoring products, we can “interact” with the services as a “hypothetical user” to detect issues before they do. This is called synthetic transaction monitoring and in my experience has been invaluable at delivering a proactive response to incidents and improving customer experience

For example:

  • Cloud Service Provider Monitoring: AWS Monitoring offers monitoring of its cloud platform and the AWS services   [ Example: https://docs.aws.amazon.com/step-functions/latest/dg/procedure-cw-metrics.html ]
  • Platform As A Service (PaaS) Provider Monitoring: Mulesoft offers an “Integration Services” platform as a service and provides monitoring for its on-prem or cloud-offerings which includes monitoring for the platform and its runtime components (mule applications) [Example: https://www.mulesoft.com/platform/api/monitoring-anypoint]
  • Monitoring Products: Products like New Relic, Dynatrace, App dynamics etc. work great if your enterprise spans a variety of cloud or on-prem services, needs a centralised monitoring solution and requires advanced features such as synthetic transactions, custom plugins etc

Alerting and Actions

Alerting allows users to be notified when monitored resources cross a threshold ( or trip some management-rule. Alerting depends on monitoring and is a proactive approach to knowing how systems are performing at any point in time

While alerts can be great, they can quickly overwhelm a human if there is too many.  One strategy is for the system to take automatic action if there is an alert threshold reach and let the human know it has done something to mitigate a situation. For example:

  • If the API is overloaded (504 – Gateway timeout) but still processing requests, then spin up a new instance of the component to serve the API from a new runtime
  • If downstream service has gone down (500 – Service unavailable) or is timing out (408 – Request timeout) then trip the circuit breaker i.e return 504 from this API
  • If there is a known issue with the runtime heap memory which causes the application to become unresponsive every 20ish hours, then start a new instance when heap reaches a certain threshold and restart this se

Screen Shot 2020-03-13 at 1.52.26 pm

A Sample Dynatrace is shown below with the status of microservices and metrics over time per instance

Screen Shot 2020-03-03 at 9.42.02 am

Logging, Auditing and Transaction Tracking

This tells us about a specific functional context at a point-in-time and given by Logging solutions over our microservices and end systems. Generally, this type of information is queried from the logs using a transaction id or some customer detail and happens after an issue or defect is detected in the system. This is achieved through logging or distributed tracing 


  • Use log levels – DEBUG, INFO, ERROR and at each level log only what you need to avoid log streams filling up quickly and call from your friendly enterprise logging team
  • Avoid logging personally identifiable (PI) information ( name, email, phone, driver’s licence etc) – imagine this was your data flowing through someone’s logs, what would you like them to store and see?
  • Log HTTP method and path if your framework does not do that by default


  • Is logging user actions for tracking access especially for protected resources
  • Involves logging information about “who”, “when” and “which resource”
  • Is compact and concise to enable faster detection (less noise in the logs the better)
  • Usually, separate from functional logs but can be combined if it suits


  • Useful for looking at things end-to-end, User Interface to the backend systems
  • Uses trackingIDs to track transactions with each point forwarding the trackingID to the next point downstream
  • Each downstream point must respond back with the same trackingID to close the loop
  • The entry-point, i.e. service client (Mobile app, Web app etc) must generate the trackingID. If this is not feasible then the first service accepting the request must generate this unique ID and pass it along

Screen Shot 2020-03-13 at 4.36.53 pm


Heatmaps and historical views

This type of view is constructed from looking at long term data across a chain of client-microservices-provider interactions. Think of a heatmap of flows and errors which emerge over time through traces in the system. This information obviously is available after a number of interactions and highly useful in building strategies to detect bottlenecks in the solution and improve service quality to the consumers

A historical view with heatmaps is achieved through aggregated logs overlayed on visual flow maps grouped by some processID or scenarioID

One example of this in the view below from a tool called Camunda Cockpit. Camunda a lightweight embedded BPMN engine and used for orchestrating services in a distributed transaction context (learn more from Bernd Rucker here https://blog.bernd-ruecker.com/saga-how-to-implement-complex-business-transactions-without-two-phase-commit-e00aa41a1b1b)




  1. Observing distributed system components is important for operational efficiency
  2. Observations types vary based on observation context and need
  3. Monitoring, Alerting, Logging, Auditing, Tracing and Historical views are types of observations we can make
  4. Observation ability is available out-of-the-box for platforms (AWS, Mulesoft etc) and in 3rd party products (Dynatrace, AppDynamics, New Relic etc). Some things you still need to bake into your API or Microservices framework or code to achieve über monitoring
  5. Use observations not just to react now but also to improve and evolve your technology platform

Tackling complexity: Using Process maps to improve visibility of integrated system features

“Entropy always increases– second law of thermodynamics

Enterprise systems are similar to isolated physical systems, where the entropy or hidden-information always increases. As the business grows, our technology footprint grows as new systems are implemented, new products and cross-functional features are imagined and an amazing network of integrations emerge

Knowing how information flows and managing the chaos is therefore critical organisations are to move beyond “Functional-1.0” into “Lean-2.0” and “Strategic-3.0”  in their implementations. We discuss how current documentation and technical registries simply “tick the box” and there is a better approach to manage increasing complexity through better context

Enterprise Integration Uses Integrations

Current state: Integration Interface Registries with little context

The network of integrations/interfaces (blue-circles above) are often captured in a technically oriented document called “Interface Registry” in a tabular form by teams performing systems integration. While these tables provide details around “who” (producer/consumer details) and “how” (the type of integration) they cannot describe “when” and “why” (use case).  As projects grow and interfaces grow or are re-used the number of when and whys increase over time and entropy (hidden information) around these interfaces grows; this leads to chaos as teams struggle to operate, manage and change them without proper realisation of the end-to-end picture

As a result, only maintaining a technical integration Interface registry leads to poor traceability (business capability to technical implementation), increased maintenance-cost of interfaces ( hard to test for all scenarios) and leads to duplication of effort over time ( as change becomes complex, teams rewrite)

Screen Shot 2019-11-21 at 6.22.21 pm

Integration Interface Repository

Therefore without proper context around Integration Interfaces, organisations will struggle to manage and map cross-functional features leading to slower lead-time, recovery etc over time. We propose that documenting integration use-cases, in a business-friendly visual language and related them to technical interface lists and enterprise capabilities is the key to mastering the chaos

Mastering the chaos: Building a context map

Context is key as it

  1. It drives product-centric thinking vs project-based thinking
  2. It makes our solution more operable, maintainable and re-useable 

In order to  provide better context and do it in a clear visually-oriented format, we believe documenting integration user-stories as technical process flows is a good start

Consider the following use-case: “As a user, I must be able to search/register/update etc in a system”.  Use-cases begin all start with some activation point – a user, timer or notification and then involve orchestration of services or choreography of events resulting in actions within microservices or end-systems eventually delivering some value through a query or command. We can render such a use-case into a map showing the systems, interfaces and actions in them (activation point, services, orchestrations, value) and do so in a standard manner

Screen Shot 2019-11-21 at 5.58.33 pm

For example, we leveraged the Business Process Management Notation – BPMN 2.0 standards to map integration technical use-case flows where we used general concepts like “swim-lanes” for user and systems, “arrows” for Interfaces (solid for request-response interfaces, dotted-lines for async messages) etc.

The Picture below shows this concept along with the “Interface” lines and “Messages” connecting the boxes (actions) between systems. Each interface or message then was linked to the Integration Interface Registry so that it was easy to trace reuse and dependencies

Screen Shot 2019-11-21 at 4.26.04 pm

It was also important that the context picture above is fairly lean as it avoids documenting too much to avoid becoming a single giant end-to-end picture with everything on it. It is best to stay within a bounded-context and only refer to a specific use-case such as “User Registration” or “Order submission” or “Customer Management” etc. This has the added advantage of helping teams which speak a ubiquitous language talk to a collection of pictures belonging to their domain and integration-practitioners to identify a collection of such teams (bounded-contexts)

Building a library and related to EA

The journey to improve visibility and maintenance of integration artefacts then involves capturing these integration use-case context maps, storing them in a version-controlled repository, relating them to other technical and business repositories

This collection of context maps would contain similar information to a “high-level enterprise system integration view” but with a greater degree of clarity

Screen Shot 2020-02-24 at 7.02.43 pm

This collection can also be linked to the Enterprise Architecture (EA) Repository for full end-to-end traceability of Business Capabilities into Technical Implementations. In fact, the TOGAF framework describes an external Business Architecture repository pattern as part of Solution building blocks (see TOGAF structural framework )

We imagine the Integration Context Map repository linked to the Enterprise Architecture Repository and the Integration Interface repository as shown below – this would provide immense value to cross-functional teams and business stakeholders, allowing both to see a common picture

Screen Shot 2019-11-19 at 7.54.49 pm

Sequence flows or process flows?

Sequence diagrams can also be used to document technical use-cases with systems and interfaces, however similar to the Integration interface list, they then to be difficult to consume for the non-technical users and lack the clarity provided by process maps

Screen Shot 2019-11-19 at 6.06.42 pm

As a general rule of thumb we found the following segregation to be useful:

  1. What: Technical process flows for end-to-end visibility, especially useful in complex long-running distributed features.  Sequence diagrams for technical component designs, best for describing how classes or flows/sub-flows (in Mule, for example) interact
  2. Who:  Context maps by Business Analysts (BA) or Architect and Sequence flows by Developers
  3. When: Context maps by Business Analysts (BA) as early as during project Discovery, providing inputs to sizing and visual map of what-is-to-be (sketch?). Sequence flows by Developers, as a task in Development story
Screen Shot 2019-11-14 at 6.43.16 pm

Let us talk tools

There are a variety of tools that can help document process context maps in a standard BPMN 2.0 format. The key criteria here is to produce a standard artefact – a BPMN 2.0 diagram so that it can be managed by standard version-control tools and rendered to documents, team wikis etc. though tools/plugins

Below is a list of tools you can try, we recommend not getting too hung up on tools and instead focus on the practice of documenting integration use-cases



  1. As enterprise projects deliver more integrated solutions, it becomes harder to manage and change integration interfaces without proper traceability
  2. Improve traceability of a single end-to-end use-case through a context map
  3. You can use BPMN 2.0 for a standardised notation to do this and use tools to generate these context maps as .bpmn files
  4. You can version control these .bpmn  files and build a collection of context maps
  5. You can link these context maps to Integration Interface registry and Enterprise Business capability registry for increased traceability across the enterprise
  6. There are many tools to help you write the .bpmn files, don’t get hung up on the tools. Start documenting and linking to the interface registry


The context map collection then becomes very useful for enterprise architecture, integration operations, new project teams, testing etc. as a common visual artefact as it relates to the users, systems and interfaces they use 

Enterprise Integration process maps then become a powerful tool over time as they greatly improve visibility across the landscape and help teams navigate a complex eco-system through a contextual and meaningful visual tool; this leads to better open and maintainable integration products leading to reuse and cost-efficiency 


Complex Form Evaluation with Drools


Complex business rules are best implemented using a ‘Rules Engine’. Drools is an open source Business Rules Management Product. See here

Screen Shot 2016-08-10 at 5.03.21 PM.png


In this blog we will cover a few basics of using the Drools rule engine, specifically using a Domain Specific Language (DSL) which is a language that is more user focused. The blog comes with a demo project which can be downloaded and used along with this document.

Demo Use Case

Our demo use case will cover evaluating an ‘Application Form‘ with multiple ‘Sections

Each form section has a ‘rule‘ which the current form evaluators (manual task) use to evaluate the ‘Questions‘ in the form. Each form question has one or more ‘option‘ selected.

For example:

   - Section1
        - Question1
            - Option1
        - Question2
            - OptionA,OptionB
   - Section2
        - Question1
            - OptionX,OptionY

Now let us assume a use case with a few simple questions and conditions associated with a particular form, for example, a ‘weekend work approval’ form. We can as a few simple questions


  • Section1:
    • Question1: “Is this necessary work”
      • options: [Yes, No]
      • rule: “Approved if Yes is selected”
  • Section2:
    • Question1: “When is this work to be done”
      • options: [Weekend Work, Regular time]
      • rule: “Manager approval required if this is weekend work
  • Section3:
    • Question1: “Is this an emergency”
      • options: [Non-Emergency, Emergency]
      • rule: “If this is an emergency fix then it is approved

As you can see in our sample use case, we have only one question per section but this can be more.

Screen Shot 2016-08-10 at 5.04.18 PM.png

Code Repository

You can download the source code from here using


Execution Instructions

Run the form.demo.rules.RulesExecutor java main to run the demo

First Steps – a simple condition

A Simple Rule is implemented in file called rule.dslr

package form.demo.rules;
import form.demo.rules.facts.*
import function form.demo.rules.RulesLogger.log

expander rule.dsl

// ----------------------------------------------------
// Rule #1
// ----------------------------------------------------
rule "Section1 Rule1.1"
  Form has a section called "Section1" 
  Section outcome is "No Further Review Required" 

The DSLR file imports facts from the package form.demo.rules.facts. There is a function called log defined in form.demo.rules.RulesLogger class. There is an expander for the DSLR that converts the expander rule.dsl is defined as the expander

The DSL for the rule is in the rule.dsl file

#  Rule DSL
[when]Form has a section called {name}=$form:FormFact(getSection({name}) != null)
[when] And = and
[when] OR = or
[then]Section outcome is {outcome}=$form.getSection({name}).setOutcome({outcome});log(drools,"Section:"+{name}+", Outcome:"+{outcome}+", Rule Applied:"+ drools.getRule().getName() );

When executed against a set of Facts

  FormFact formWithFacts = new FormFact();
  formWithFacts.addSection("Section1", "Question1", "Yes");
  FormAssessmentInfo assessmentInfo = new RulesExecutor().eval(formWithFacts);
Dec 01, 2015 3:49:09 PM form.demo.rules.RulesLogger log
INFO: Rule:"Section1 Rule1.1", Matched --> [ Section:Section1, Outcome:No Further Review Required, Rule Applied:Section1 Rule1.1]
Not Evaluated
     Section1->No Further Review Required

Adding a second condition

– If some option is selected in a section then set the outcome to a value

// ----------------------------------------------------
// Rule #1
// ----------------------------------------------------
rule "Section1 Rule1.1"
  Form has a section called "Section1"
  -"Yes" is ticked in "Question1"
  Section outcome is "No Further Review Required" 
#  Rule DSL
[when]Form has a section called {name}=$form:FormFact(getSection({name}) != null)
[when]-{option} is ticked in {question}=eval($form.getSection({name}).has({question},{option}))
[when] And = and
[when] OR = or
[then][Form]Section outcome is {outcome}=$form.getSection({name}).setOutcome({outcome});log(drools,"Section:"+{name}+", Outcome:"+{outcome}+", Rule Applied:"+ drools.getRule().getName() );

Adding Global Rules

  • If a section acts as a global flag (for example: Emergency Approval) then ignore all outcomes and select this
  • If there is no global flag then if any of the sections have outcome ‘foo’ then set the form outcome to ‘bar’ otherwise set the form outcome to ‘baz’

In the Rule DSL we add the following, notice how a new instance of the FormFact is created – this time without matching a section name

[when]The Form=$form:FormFact()
[when]-has a section with outcome {outcome}=eval($form.hasSectionWithOutcome({outcome}))
[when]-has no section with outcome {outcome}=eval($form.hasSectionWithOutcome({outcome}) == false)
[then]Form outcome is {outcome}=$form.setOutcome({outcome});log(drools,"Form Outcome Set to "+{outcome});

In the DSLR we implement a few global rules

// ----------------------------------------------------
// Global Rule #1
// ----------------------------------------------------
rule "Global Rule1.1"
  The Form 
  -has a section with outcome "Emergency Work"
  Form outcome is "Approved" 

// ----------------------------------------------------
// Global Rule #2.1
// ----------------------------------------------------
rule "Global Rule2.1"
  The Form 
  -has no section with outcome "Emergency Work"
  -has a section with outcome "Manager Review Required"
  Form outcome is "Manager Review Required" 

// ----------------------------------------------------
// Global Rule #2.2
// ----------------------------------------------------
rule "Global Rule2.2"
  The Form 
  -has no section with outcome "Emergency Work"
  -has no section with outcome "Manager Review Required"
  Form outcome is "Manager Review Required" 

Why is this not an API contract?

Why is this … my Swagger UI, generated from code not a contract? It describes my service, therefore it must be a Service Provider Contract. No? 

This was a common theme for a few of our clients with mobile/web teams as consumers of enterprise services.  Service providers generated contracts, and would sometimes create a contract in the API portal as well.

Service consumers would then read the contract from the API portal and consume the service. What the problem then?  …


….the problem is that the red box i.e the Contract – is generated after the service is implemented and not vice-versa.

Why is this a problem then?  It is a problem because the contract is forced upon the consumer and worse there are 2 versions of this document.

So what? Well as you can imagine, changes to the service implementation over time will generate the provider contract (red box), while consumers continue to read the out-of-sync contract.

so? A contract is an agreement between the two parties – consumer and provider. In the above use-case though, this is not the case.

Key Points:

  • Generated Swagger UI is a documentation and not a contract
  • A contract is a collaborative effort between Providers and Consumers
  • A product (API Gateway) cannot solve this problem, it is cultural
  • The above process will create 3 layers of responsibility – Service provider, Service consumer and middleware provider
  • The 3 layers of responsibility makes it harder to test APIs

Side note: I believe this was a big problem with SOA – the “Enterprise Business Service (EBS)” was owned by the middleware team and “Application Business Services (ABS)” was owned by the services teams.

The fix?

Collaborative contracts that help define what a service should do!

This contract is used by Consumers to build their client-code and more importantly the providers use the contract to build the service and test it!