Implementing Stateful Process Adapters: Embedded BPMN or AWS Step Functions

Just based on recent experience, I am going to put this out there – AWS Step Functions are great for technical state machines which move from one-activity to another but not really designed for stateful process orchestration and definitely not for implementing SAGA

Serverless Step Functions from AWS or BPMN Engines?

When building microservices, the Mulesoft type platform lets you do a lot of the “stateless” request/response or async interfaces really well. But for “stateful” things, especially ones where we need the following, I think AWS Step functions are a half-baked option

This is because there are good embedded BPMN engines that can do the following:

  • Do stateful end-to-end flows and show them in a dashboard
  • Do stateful flows with activities with Synchronous or Asynchronous (request/response i.e. one-way request and then a wait for a message) actions (with AWS step functions, you code your way out of this)
  • Do out-of-the box RESTful APIs for starting a process, getting the tasks state for a process or pushing the state forward etc
  • Do business friendly diagrams
  • Do operational views with real-time “per process” view of current state or amazing historical views with heat-maps 
  • Easy to manage and maintain by the lowest common denominator in your team – lets face it, the cost of maintenance depends on the cost of your resource supporting it and not everyone is AWS skilled and cheap

The only argument I had heard for AWS was that it was better than the embedded BPM engines because we did not need to manage a database. We threw that argument out when our Step Functions had to use DynamoDB to handle storing the complex state

Screen Shot 2020-08-19 at 3.36.40 pm

Comparing the two offerings

Summary

Given my experience at a few clients with embedded BPM Engines and AWS Steps in implementing Long Running processes, I have found that Step Functions are great at doing simple state transitions but not easily maintainable and operable with issues around handling async activities and roll-backs – they can be done but you need to code for it!

The existing light weight BPM engines like Camunda offer a better alternative with self-managed and even hosted options and I love they way they present the process states visually especially the heat-maps with historical information

If you want a lot of simple state machines with scale – pick the serverless option but if you want a solid orchestration option, my preference is using BPMN engines like Camunda

heatmap

Stateful microservices pattern

What are stateful microservices?

Microservices holding state while performing some longer-than-normal execution time type tasks. They have the following characteristics

  1. They have an API to start a new instance and an API to read the current state of a given instance
  2. They orchestrate a bunch of actions that may be part of a single end-to-end transaction. It is not necessary to have these steps as a single transaction
  3. They have tasks which wrap callouts to external APIs, DBs, messaging systems etc.
  4. Their Tasks can define error handling and rollback conditions
  5. They store their current state and details about completed tasks

Screen Shot 2020-03-13 at 7.52.57 pm

Why stateful?

Stateless microservice requests are generally optimised for short-lived request-response type applications.  There are scenarios where long-running one-way request handling is required along with the ability to provide the client with the status of the request and the ability to perform distributed transaction handling and rollback (because XA sucked!)

So you need stateful because

  • there are a group of tasks that need to be done together as a step that is asynchronous with no guaranteed response-time or asynchronous one-way with a response notification due later
  • or there are a group of tasks where each step individually may have a short response time but  aggregated response-time is large
  • or there are a group of tasks which are part of a single distributed transaction if one fails you need to rollback all

Stateful microservice API

Microservices implementing this pattern generating provide two endpoints

  1. An endpoint to initiate: for example, HTTP POST which responds with a status code of “Created” or “Accepted” (depending on what you do with the request) and responds back with a location
  2. An endpoint to query request state: for example, HTTP GET using the process id from the initiate process response. The response is then the current state of the process with information about the past states

Sample use case: User Signup

  1. The process of signing-up or registering a new user requires multiple steps and interaction looks like this [Command]
  2. The client can then check the status of the registration periodically [Query]

Command

POST /registrations HTTP/1.1Content-Type: application/jsonHost: myapi.org

{ "firstName": "foo","lastName":"bar",email:"foo@bar.com" }
HTTP/1.1 201 Created  
Location: /registrations/12345

Query

GET /registrations/12345 HTTP/1.1Content-Type: application/jsonHost: myapi.org

{ "firstName": "foo","lastName":"bar",email:"foo@bar.com" }
HTTP/1.1 200 Ok  

{ "id":"12345", "status":"Pending", "data": { "firstName": "foo","lastName":"bar",email:"foo@bar.com" }}

Screen Shot 2020-03-13 at 7.38.41 pm

Anti-patterns

While the pattern is simple, I have seen the implementation vary with some key anti-patterns. These anti-patterns make the end solution brittle over time leading to issues with stateful microservice implementation and management

  1. Enterprise business process orchestration: Makes it complex, couples various contexts. Keep it simple!
  2. Hand rolling your own orchestration solution: Unlike regular services, operating long-running services requires additional tools for end-to-end observability and handling errors
  3. Implementing via a stateless service platform and bootstrapping a database: The database can become the bottleneck and prevent your stateful services from scaling. Use available services/products as they optimised their datastores to make them highly scalable and consistent
  4. Leaking internal process id: Your end consumer should see some mapped id not the internal id of the stateful microservice. This abstraction is necessary for security (malicious user cannot guess different ids and query them) and dependency management
  5. Picking a state machine product without “rollback”: Given that distributed transaction rollback and error-handling are two big things we are going need to implement this pattern, it is important to pick a product that lets you do this. A lightweight BPM engine is great for this otherwise you may need to hack around to achieve this in other tools
  6. Using stateful process microservices for everything: Just don’t! Use the stateless pattern as they are optimal for the short-lived request/responses use cases. I have, for example, implemented request/response services with a BPEL engine (holds state) and lived to regret it
  7. Orchestrate when Choreography is needed: If the steps do not make sense within a single context, do not require a common transaction boundary/rollback or the steps have no specific ordering with action rules in other microservices then use event-driven choreography

Summary

Stateful microservices are a thing! Welcome to my world. They let you orchestrate long-running or a bunch of short-running tasks and provide an abstraction over the process to allow clients to fire-and-forget and then come back to ask for status

Screen Shot 2020-03-13 at 8.37.14 pm

Like everything, it is easy to fall into common traps when implementing this pattern and the best-practice is to look for a common boundary where orchestration makes sense

Screen Shot 2020-03-13 at 8.33.59 pm

 

Camunda BPM – Manual Retry

The Camunda BPM is a lightweight, opensource BPM platform (See here: https://camunda.com/bpm/features).

The “Cockpit” application within Camunda is the admin dashboard where deployed processes can viewed at a glance and details of running processes are displayed by the process instance ids. Clicking on a process instance id reveals runtime details while the process is running – process variables, task variables etc. If there is an error thrown from any of the services in a flow, the Cockpit application allows “retrying” the service manually after the coded automatic-retries have completed.

The steps below show how to retry a failed process from the Cockpit administrative console

Manual Retry

From the Process Detail Screen select a failed process (click on the GUID link)

Then on the Right Hand side you should see under “Runtime” a “semi-circle with arrow” indicating “recycle” or “retry” – click on this

This launches a pop-up with check-boxes with IDs for the failed service tasks and option to replay them. Notice these are only tasks for “this” instance of the “flow”

After the “Retry” is clicked another pop-up indicates the status of the action (as in “was the engine able to process the retry request”. The actual Services may have failed again)