Modern software is built over the network with systems hooked-up either privately within the internal enterprise eco-system or with a trusted partner or via a public channel
Traditional system integration (ESB) engineers working within project directives would have delivered interfaces for system to system or partner to system integration while working closely with the consuming teams
This would have created, despite our best effort, very “leaky abstraction” of the provider system’s service implementation in the form of internal ids, system namespaces, validation logic etc in our API to the consumer
Impact of a leaky abstraction
APIs are a means to decouple the implementation by providing a good abstraction (we can never have a perfect abstraction) and our past experience with system integration has show that thinking only in a single context (EAI) in a project sense leads to a leaky abstraction (and tight coupling agnostic of technology used). Thus not reasoning for all contexts early on and thinking of your APIs as Products for Digital, B2B etc can lead to an improper abstraction
Impact of a leaky abstraction are
– Tighter coupling between service provider and consumer:
– Security: Attacker can guess internal implementation and launch a data hack by manipulating requests which would look valid to the server
Where should we focus?
– Service naming: API URI leaks something about the system serving it and not about the domain product, its business context and resource
– Resource Identifier: Does your Create Object return an internal database row ID?
– Date Time: Internal server datetime representation vs a standard causing usual datetime issues and logic to be implemented by consumers to handle your server issues
– Language implementation details in data: Ever see a java.util.List come back in a payload?
– Server headers: HTTP response headers from internal server often do not get the full treatment. We either skip them or block them all. When sending back backend server headers, try to reason why it is necessary or obsfucate (hash) so that someone cannot guess a sequence etc
A cache is, in my definition, a store of some data. This data could be the actual copy or a facsimile of some data stored somewhere else and held for a duration
A caching solution implementation hold this information in memory (volatile) or offload to a persistent store. When not holding the true copy of the information, invalidation or refreshing becomes an issue
The internet was built on caches and if you do microservices then you should know about how to use one to save your life! For example, Web Portals implementations die an early death during performance testing because the backend response to requests “took a long time”
Know how to implement caches, especially for responses and when to use what type of cache to build an effective solution and a snappy application!
Cache types and patterns
Persistent vs Volatile: Does it remain when the power is turned off?
Cache-aside vs Operational data: Is it the source of truth or does it periodically refresh this information from another source
Invalidating a cache
Simply the hardest problem in Computer Science. Knowing when to clear your copy of the data is key, especially when you do not master the information
Some of the invalidation strategies are
Events from the master: Subscribe to refresh events or full updates from master system
Periodic refreshes: Use a timer to refresh your cache (especially if there are known update cycles in the master data system)
Explicit invalidation: Use an API to clear your cache
Know when to cache ’em
Availability: It is 2020 and your users want a response now! Drop-downs need to be snappy, lookups in O(1) and fewer network calls across the pond
Reliability: You like your consumers but not enough to let them smash the heck out of your core business systems. Self-service portals/mobile-apps backed by APIs are more vulnerable to scripted attacks and if your service is lazy and always going to the system-of-record then you are risking an outage
De-Coupling: Separation of concerns – Command vs Query. You want to isolate the process that accepts a request to “creating/updating” vs “reading” data to reduce coupling the two contexts. This is to prevent scenarios where a sudden rush of users trying to read the state of a transaction do not block the creation of new transactions. For example, Order booking can continue even if there is a flood of requests for order queries
Know when not to cache
Consistency: The application needs to serve the current true state of the information (How much money to I have now vs my transaction history)
Cache Key Complexity: Searches are hard to cache because they generate large and complex set of search keys for the results. For example, consider implementing a cache for a type-ahead search where each word typed is a callout to a search API. The result set for each word would require a large memory footprint and is notoriously hard to size for. A better approach is to only cache the individual item (resources) returned in the result set or array and not cache the search result array
Ambiguity: This relates to consistency. If you do know know when to refresh your cache, especially if you are not the master system or if this information changes in real-time then look for other solution options. For example, a website has a system that updates a user’s account balance in real-time (betting and gambling?) and the Account API for the user is looking to scale to hundreds or thousands of user requests per second (Melbourne cup day?) – would you cache the user’s account (money) information or look at some other strategy (streaming, HTTP push)
One of the API anti-patterns is going straight to the system of record data, especially for retrieving data in public facing web applications. The best way to serve information is to understand where on the spectrum of static to dynamic does it sit and then implement a solution to serve the data with the highest degree of consistency
Static non-changing resources are served by CDNs. The next layer is reliant on your APIs and how effectively you implement caching. Hope you got a taste of the types of cache and strategies in this post. There is certainly a lot more to caching than I have talked about here, the internet is a giant knowledge cache – happing searching!
RESTful APIs can be internal (your company’s only) or public facing (Twitter). Thus internal APIs are called “Private APIs” and open to the public APIs are called “Open APIs”
Now, while building an API accelerator for our clients I was asked by a well meaning colleague if this was an Open API; the intent was right but there was a subtle error in the language semantics. I believe what he meant was “did you write the API using Open API Specification” and not “is this API open to the world wide web?”
Open API specification or OAS is derived from Swagger and the de-facto standard for writing APIs. You can write public facing or internal APIs with OAS. Simply picking this style does not make your API open to the public i.e. you need to host this API on a public portal to make your Open API Specification based API Open to the public
Just because I wrote my APIs using RAML (the other standard) does not make it closed or non-standard. The Open API Specification is a good standard and you can convert from RAML to OAS
It is important to write an API specification and do it well regardless of the specification language
Just based on recent experience, I am going to put this out there – AWS Step Functions are great for technical state machines which move from one-activity to another but not really designed for stateful process orchestration and definitely not for implementing SAGA
Serverless Step Functions from AWS or BPMN Engines?
When building microservices, the Mulesoft type platform lets you do a lot of the “stateless” request/response or async interfaces really well. But for “stateful” things, especially ones where we need the following, I think AWS Step functions are a half-baked option
This is because there are good embedded BPMN engines that can do the following:
Do stateful end-to-end flows and show them in a dashboard
Do stateful flows with activities with Synchronous or Asynchronous (request/response i.e. one-way request and then a wait for a message) actions (with AWS step functions, you code your way out of this)
Do out-of-the box RESTful APIs for starting a process, getting the tasks state for a process or pushing the state forward etc
Do business friendly diagrams
Do operational views with real-time “per process” view of current state or amazing historical views with heat-maps
Easy to manage and maintain by the lowest common denominator in your team – lets face it, the cost of maintenance depends on the cost of your resource supporting it and not everyone is AWS skilled and cheap
The only argument I had heard for AWS was that it was better than the embedded BPM engines because we did not need to manage a database. We threw that argument out when our Step Functions had to use DynamoDB to handle storing the complex state
Comparing the two offerings
Given my experience at a few clients with embedded BPM Engines and AWS Steps in implementing Long Running processes, I have found that Step Functions are great at doing simple state transitions but not easily maintainable and operable with issues around handling async activities and roll-backs – they can be done but you need to code for it!
The existing light weight BPM engines like Camunda offer a better alternative with self-managed and even hosted options and I love they way they present the process states visually especially the heat-maps with historical information
If you want a lot of simple state machines with scale – pick the serverless option but if you want a solid orchestration option, my preference is using BPMN engines like Camunda
Hello! This one is going to be short and less formal a post. I want to get these questions out there before they eluded me and then come back later to this post (or another) and answer some of these
I have been thinking about how we are putting out more integrated solutions now than 10 years ago and how two implementations with the same number of systems grow differently over time to be more or less nimble and more important grow to more or less chaos with issues around Data integrity, Operations etc
I want to apply formal analysis to compute the amount of “hidden information” (entropy) in our implementations both from the systems and integrations perspective and from the data flowing through it
It is like calculating how ordered the arrangement of a set of marbles will be bouncing over a platform supported by jointed arms
Okay. Entropy is close to me because I love Time Travel and in reality you cannot time travel become of the 3rd Law of Thermodynamics (darn you Boltzman) which says “Entropy Always Increases”
While entropy has to do with heat, it actually talks about the amount of order in a system and the fact that you cannot go back to order from a chaotic arrangement (broken cup) without expending energy
Thinking about this a little deep, if we consider the systems and integrations we build then there is a state where they are in perfect order and with each transaction integrated solutions move towards chaos (or disorder)
Fun Experiment:Try doing this with post-it notes and 2 / more kids as the nodes – ask one child to tell the other to do something and pass the notes as information etc. Things will get chaotic over time!
More nodes, edges, data in edges = high degree of freedom or possible states for the system to be in and this is the entropy. I argue that well integrated enterprise systems feel easier to manage and operate because they requires less energy to bring to order – this is for the integration platform. I am trying to think about what this means for the data that these integrations bring – one hypothesis is that the type of integrations determine the quality of data as coupling contexts could cause data chaos
Yep. I think given the same set of enterprise systems and same integration product – the initial conditions in data (what the systems have) and the integration design determine how much Data Entropy an organisation has over time. The Integration Entropy may be the same but the Data Entropy could be different
We want to come up with a way to measure this complexity for a proposed or existing solution
Here are some of my notes and questions, feel free to ping me your thoughts
Integrations add complexity
We are doing more integrations because the Web has come to the Enterprise
10 years ago we were struggling to deliver more than 5-10 end-to-end integrations in a 12 month period due to environments, team structure, protocol issues etc. We have accelerated with DevOps, PaaS, APIs, Contracts and better tools to deliver end-to-end solutions ( we built 100 interfaces in 10 months )
More end-to-end contextual micro-services
More moving parts
Greater degree of freedom in which clients use these services and integrate
What is the amount of Entropy or hidden information about the integration complexity now vs before?
Does the adopting of RESTful pattern to explicitly show context in APIs simplify this vs RPC style where the context is implicit in the data?
What determines the complexity then?
Number of nodes
Number of edges
Number of contexts per edge (nuanced or direct use)
Amount of data per edge
The type of edge – sync request/response, sync one-way, async one-way (event), async request/response
Modern software engineering is oriented towards building networked distributed features for a highly connected and web savvy customer base in varying contexts. Traditional team structures within the enterprise have evolved from technical SME cliques as engineers who “Ate lunch together wrote Software together”
Good product strategy requires thinking about product features are built by engineering teams because Conway’s Law drives the outcome – to build, maintain and change great functional products we need to deliberately fix the team organisation
Conway’s Law and Layered Architecture
Melvin Conway described a law where he states that the parts of a software system are directly proportional to the organisational structure
“Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations.” – Melvin Conway 
“If you have four groups working on a compiler, you’ll get a 4-pass compiler” – Eric S Raymond 
We can think of the modern teams oriented around the customer and end-systems to deliver integrated products in the following manner (circa 2020)
Thus, broadly speaking, teams orient around technical layers due to the cohesiveness of the “technical domain” expertise & customer’s (software client) needs.
If you have been part of similar teams in the past you must be familiar with the “layered architectural style” that gives rise to this team
Technical Teams = Technical Product
While the layered teams do a great job of grouping the teams by technical expertise, business needs for new end-to-end features are contextual, especially in a large enterprise. Things are more contextual in the real-world
While the layered team structure may work initially for one context, we find adding new features and domain contexts makes it harder to maintain software components across a layer. This leads to slower software delivery due to various reasons
It is becomes increasing difficult for a single team to know and own aspects of a technical layer used in various contexts
Conway’s law: With layered teams, we build layered software components focussed on the technical correctness vs end-to-end functionality
Techno-functional Teams = Functional Products
Applying the “reverse Conway manoeuvre” – fixing our team organisation to shape our software we break the technical teams into smaller functional teams. This process naturally builds a more “Domain Centric Product team” which can focus on functional needs and stay true to the end-to-end feature
This organisation allows for a closer relationship across technology layers as we group experts in different technical domains to deliver a common outcome and if done well, allows a smallish technical team to own a contextual solution
Pitfalls of Functional Teams
Functional teams are not the perfect solution. Functional teams are highly agile and can own a specific end-to-end solutions through the entire software lifecycle but they operate in their own bubble
When organisations have different “technical practices” within their IT with different software engineering standards (one for the Portal team, one for the Integration team, one for the CRM team etc) it becomes harder to enforce technical standards within a technical context when a member is “outsourced” to another team
Modern software engineering is oriented towards building networked distributed features for a highly connected and web savvy customer base in varying contexts. Traditional team structures within the enterprise have evolved from technical SME cliques as engineers who “Ate lunch together wrote Software together”
Good product strategy requires thinking about breaking up the technical groups into highly effective functional teams and keeping the “band together” through the lifecycle of the software (the end-to-end product)
APIs can be used to query information or command changes in system of record. Even the most efficient & fast real-time web service still struggles with having to wait for an “ack” (acknowledgement) leading to coupling between the invoker & provider. If this happens within a cohesive business function then the coupling is necessary, however this can be a showstopper if done across business contexts!
APIs with requests waiting on responses are notorious for blocking consumer applications
Dealing with slow or unresponsive APIs can be tolerated in certain circumstances however band-aid solutions can end up hurting user experience and damage your product’s reputation, pushing customers away
Good API design can help remediate this but it requires upfront understanding of how the customer facing systems are interacting with the internal systems & what their needs are
For example, simply throwing a cache & reading slightly stale information might be sufficient in a Query context – but how do you fix a Command API that is slow or unresponsive?
We look at some key concepts to break apart interactions and solve for X
Know your Contexts
Imagine you are in the business of selling coffees to happy customers and all you need is a point-of-sale system (let us imagine you have an endless tap of fresh coffee). Your business is all about selling then and you only operate in a single context
You may also define this as “We interact with the customer in a single context”, but let me warn you – you might be stretching to definition of a customer here (at-least from a technology implementation standpoint). I would argue they are mere transactional contacts vs true customers in the system (we will see why later)
Let us now imagine your coffee shop now need to send out birthday coupons to your customers via email/sms etc. You may need to even call them and wish them (if they permit you) and so on
This act of recognising someone who is transacting with you and asking them for their preferences, interacting with them, offering them rewards for being a customer etc is different from selling coffee – welcome to the Customer Management context
Okay so your business has two contexts, Selling Coffee and Managing Customers. Are they the same or different? What actions can you not do in both without impacting efficiency? How about capturing information (even if it on a bit of paper) in the two contexts about a person?
Most of us, who have never run a busy business, mix the two and have things work in them interchangeably. This works most of the time but it can lead to serious issues down the line if we do not draw boundaries and divide the actions and information across the contexts.
Ever been to a coffee shop where you waited a long time because the 1 worker was busy on the phone talking to a customer about their birthday plans? The worker was doing their job well in the customer management context but failed in the selling / transactional context
Contexts help define the actions, language and information while boundaries help define interactions. For engineers, this is vital as it helps design the connectivity as we have to often chose between consistency of information or availability of a service within networked or distributed systems
So why could your API be killing your Customer Experience?
APIs designed without consideration of context boundaries can end up coupling two contexts. Imagine an API for Contact creation in CRM which adds direct dependency in our coffee shop example. This API is sending data in real-time between a Selling context and Customer Management context, fine and can block or error the selling transaction if there are network delays or data issues in the CRM record, not fine!
We need to carefully consider interactions between context and introduce non-blocking interactions (and anti-corruption layers) to let business happen. Make Systems Integration Great Again!
Summary In summary, services within a context have high cohesion and services across contexts should have low coupling. Bad API design (no not the RAML/OAS stuff but the solid/dotted arrow stuff) can couple two independent business contexts
If we carefully model our interactions by consciously recognising our contexts and boundaries then we can built for business transactions more efficiently and open up possibilities for future connectivity and growth
APIs are the abstractions over technical services. Good APIs mirror strategic thinking in an organisation and lead to better customer experience by enabling high-degree of connectivity via secure mechanisms
Too much focus is on writing protocols & semantics with the desire to design good APIs and too little on business objectives. Not enough questions are asked early on and the focus is always on system-system integration. I believe thinking about what a business does and aligning services to leads us to product centric thinking with reuseable services
History As an ardent student of software design and engineering principles, I have been keen on Domain Driven Design (DDD) and had the opportunity to apply these principles in the enterprise business context in building reusable and decoupled microservices. I believe the best way to share this experience is through a metaphor and I use a “Shopping mall” metaphor with “Shops” to represent a large enterprise with multiple lines of businesses and teams
Like all metaphors – mine breaks beyond a point but it helps reason about domains, bounded contexts, APIs, events and microservices. This post does not provide a dogmatic point-of-view or a “how to guide”; rather it aims to help you identify key considerations when designing solutions for an enterprise and is applicable upfront or during projects
I have been designing APIs and microservices in Health and Insurance domains across multiple lines of business, across varying contexts over the past 5-8 years. Through this period, I have seen architects (especially those without Integration domain knowledge) struggle to deliver strategic, product centric, business friendly APIs. The solutions handed to us always dealt with an “enterprise integration” context with little to no consideration for future “digital contexts” leading to brittle, coupled services and frustration from business teams around cost of doing integration ( reckon this is why IT transformation is hard )
This realisation led me to asking questions around some of our solution architecture practices and support them through better understanding and application of domain modeling and DDD (especially strategic DDD ). Thought this practice, I was able to design and deliver platforms for our client which were reusable and yet not coupled
In one implementation, my team delivered around 400 APIs and after 2 years the client has been able to make continuous changes & add new features withoutcompromising the overall integrity of the connected systems or their data
Though my journey with DDD in the Enterprise, I discovered some fundamental rules about applying these software design principles in a broader enterprise context but first we had to step in to our customer’s shoes and ask some fundamental questions about their business and they way they function
The objective is to key aspects of the API ecosystem you are designing for, below are some of the questions you need to answer through your domain queries
What are your top-level resources leading to a product centric design?
When do you decide what they are? Way up front or in a project scrum?
What are the interactions between these domain services?
How is the quality and integrity of your data impacted through our design choices?
How do you measure all of this “Integration entropy” – the complexity introduced by our integration choices between systems?
The Shopping Mall example
Imagine being asked to implement the IT system for a large shopping complex or shopping mall. This complex has a lot of shops which want to use the system for showing product information, selling them, shipping them etc
There are functions that are common to all the shops but with nuanced difference in the information they capture – for example, the Coffee Shop does “Customer Management” function with their staff, while the big clothes retail store needs to sell its own rewards point and store the customer’s clothing preferences and the electronics retail does its customer management function through its own points system
You have to design the core domains for the mall’s IT system to provide services they can use (and reuse) for their shops and do so while being able to change aspects of a shop/business without impacting other businesses
Asking Domain and Context questions
What are your top-level “domains” so that your can build APIs to link the Point-of-Sale (POS), CRM, Shipping and other systems?
Where do you draw the line? Is a service shared by all businesses or to businesses of a certain type or not shared at all?
Bounded contexts? What contexts do you see as they businesses do their business?
APIs or Events? How do you share information across the networked systems to achieve optimal flow of information while providing the best customer experience? Do you in the networked systems pick consistency or availability?
Though my journey with DDD in the Enterprise, I discovered some fundamental rules about applying these software design principles in a broader enterprise context. I found it useful to apply the Shopping Mall metaphor to a Business Enterprise when designing system integrations
It is important to understand the core business lines, capabilities (current and target state), business products, business teams, terminologies then do analysis on any polysemy across domains and within domain contexts leading to building domains, contexts and interactions
We then use this analysis to design our solution with APIs, events and microservices to maximise reuse and reduce crippling coupling
Human societies have been hit by pandemics through the ages and relied on the central governing authorities to manage the crisis and disseminate information. I believe this time around with COVID-19, our societies have access to more information from our governments because we have the internet
If this pandemic is an evolutionary challenge, then our response as a species to survive it will come through innovations in medicine, healthcare and technology. Not only will we improve on our lead time to develope vaccines as responses to viruses evolving and but also accelerate key technologies which will help us respond to global challenges as a whole
The internet has allowed governing agencies to share information the spread of COVID-19 in our communities through APIs a common channel in a clean, standardised, versioned, structured and self-describing manner leading to easier consumption by citizen and fuelling the rise of “citizen data scientists“
I argue this democratisation of pandemic data via APIs and its consumption leads to new learning opportunities, increased awareness of the spread of the disease, verification of information, better social response and innovation through crowdsourcing
Open Data: NSW Health
The https://data.nsw.gov.au/ provides access to state health data in NSW Australia and in March 2020 provided information about COVID-19 cases on their site here. This website provides a very standardised approach to sharing this information, with metadata in JSON, RDF and XML for different consumers and links to the actual data within the metadata documents
Release in March 2020
Standardised to the NSW Health Data portal (see other fun datasets)
Self-describing with 3 formats of metadata – JSON, RDF & XML
I particularly loved the structure of the JSON metadata because it is quite self-describing, leading to a link to the document with the COVID-19 data
Rise of the consumer: Our Citizen Scientist
It did not take long for someone to come along, parse that information and present to us in a portal we can all relate to. During the early days of the pandemic, I was hooked onto https://www.covid19data.com.au/ and it provided me with Australia / NSW wide information about the spread, categories etc
However it was Ethan’s site that I loved the most as a “local consumer” to see what is happening in my postcode – the website here https://covid19nsw.ethan.link/ is a brilliant example of the citizen science and is sourced from the NSW Health open data link above
Notice the URL and NSW Health agency did not built the portal – it was a this amazing person named Ethan
2020 is an interesting time in our history. We have the pandemic of our era but also better tools to understand its spread. The internet and standardised APIs are at the front and center of this information sharing and consumption
Everyone has the ability now to download information about the spread of the pandemic with time, geolocation and size embedded in this dataset. Everyone now have the ability to write programs to parse this dataset and do their own science on it
Microservices holding state while performing some longer-than-normal execution time type tasks. They have the following characteristics
They have an API to start a new instance and an API to read the current state of a given instance
They orchestrate a bunch of actions that may be part of a single end-to-end transaction. It is not necessary to have these steps as a single transaction
They have tasks which wrap callouts to external APIs, DBs, messaging systems etc.
Their Tasks can define error handling and rollback conditions
They store their current state and details about completed tasks
Stateless microservice requests are generally optimised for short-lived request-response type applications. There are scenarios where long-running one-way request handling is required along with the ability to provide the client with the status of the request and the ability to perform distributed transaction handling and rollback (because XA sucked!)
So you need stateful because
there are a group of tasks that need to be done together as a step that is asynchronous with no guaranteed response-time or asynchronous one-way with a response notification due later
or there are a group of tasks where each step individually may have a short response time but aggregated response-time is large
or there are a group of tasks which are part of a single distributed transaction if one fails you need to rollback all
Stateful microservice API
Microservices implementing this pattern generating provide two endpoints
An endpoint to initiate: for example, HTTP POST which responds with a status code of “Created” or “Accepted” (depending on what you do with the request) and responds back with a location
An endpoint to query request state: for example, HTTP GET using the process id from the initiate process response. The response is then the current state of the process with information about the past states
Sample use case: User Signup
The process of signing-up or registering a new user requires multiple steps and interaction looks like this [Command]
The client can then check the status of the registration periodically [Query]
While the pattern is simple, I have seen the implementation vary with some key anti-patterns. These anti-patterns make the end solution brittle over time leading to issues with stateful microservice implementation and management
Enterprise business process orchestration: Makes it complex, couples various contexts. Keep it simple!
Hand rolling your own orchestration solution: Unlike regular services, operating long-running services requires additional tools for end-to-end observability and handling errors
Implementing via a stateless service platform and bootstrapping a database: The database can become the bottleneck and prevent your stateful services from scaling. Use available services/products as they optimised their datastores to make them highly scalable and consistent
Leaking internal process id: Your end consumer should see some mapped id not the internal id of the stateful microservice. This abstraction is necessary for security (malicious user cannot guess different ids and query them) and dependency management
Picking a state machine product without “rollback”: Given that distributed transaction rollback and error-handling are two big things we are going need to implement this pattern, it is important to pick a product that lets you do this. A lightweight BPM engine is great for this otherwise you may need to hack around to achieve this in other tools
Using stateful process microservices for everything: Just don’t! Use the stateless pattern as they are optimal for the short-lived request/responses use cases. I have, for example, implemented request/response services with a BPEL engine (holds state) and lived to regret it
Orchestrate when Choreography is needed: If the steps do not make sense within a single context, do not require a common transaction boundary/rollback or the steps have no specific ordering with action rules in other microservices then use event-driven choreography
Stateful microservices are a thing! Welcome to my world. They let you orchestrate long-running or a bunch of short-running tasks and provide an abstraction over the process to allow clients to fire-and-forget and then come back to ask for status
Like everything, it is easy to fall into common traps when implementing this pattern and the best-practice is to look for a common boundary where orchestration makes sense