Event-driven : SAGA Compensation

Vijay Redkar
4 min readJul 30, 2021

Business Objective -

BankNext” employed event driven choreography to enhance it’s processing capacity manifolds without compromising on system flexibility. Things were good until Business noticed an abnormal system behavior when reconciling customer and account information. The requirement was that, every active customer in the system must have a valid account. It was noticed that there were many active customers without any account.

Problem Scenarios-

1- Dangling Customers - Active Customers without any valid Account
2- Linkage not created - Valid Customers and Accounts present but linkage between these entities is missing

How did that happen?

BankNext’s current event driven choreography (w/o SAGA) can be found here

Event-Driven without SAGA

Back to the Drawing Board-

After careful analysis, the Engineering team observed that -

1- BankNext’s event driven system is composed of multiple msvcs that independently manage data persistence
2- If any of the msvcs in this transactional flow fails then there is no provision to correct/rollback the data that was already persisted by an upstream msvc
3- This was the root cause of severe data integrity problems in the system

Technical Challenge -

1- The “CustomerMgt” msvc successfully creates the customer entity in the “customer” table in the RDBMS
2- Next, “AccountMgt” msvc creates the account entity in the “account” DB table , after receiving the subscription on the Kafka topic
3- In the last step, “EntityAggregation” msvc links these 2 entities by creating an entry in to the “customer_account_map” table in the RDBMS
4- The process is considered successful if & only if the above 3 steps are atomic i.e. either 3 steps complete successfully or none complete
5- If/when the “Accounts” or “EntityAggregation” msvc fails for any reason then the data integrity problems occur

Example scenario -1
-Customer entity created successfully.
-Account Msvc fails the request per the business validation rule (eg. only “SAVINGS” & “CURRENT” allowed but “OVERDRAFTS” received)
-Account entity creation aborts.
-Customer is left without an Account. Thus Dangling Customer.

Example scenario -2
-Customer entity created successfully.
-Account entity created successfully.
-EntityAggregation Msvc fails the request per the business validation rule (eg. for “USA” only type allowed is “SAVINGS” but “CURRENT” received)
-Customer and Account linking aborts
-Active Customer and Account is created but are not linked

Solution Approach : SAGA compensation

Engineering team concluded that the architecture lacked a self correcting/compensating mechanism

1- In such a failure scenario, the solution requires that all the data persisted by upstream msvcs be rolled back
2- Thus the system data is brought back to the state that it was before this txn started
3- SAGA approach is recommended to accomplish this task

Technical Implementation : SAGA compensation (Github)

Event-Driven with SAGA

1- New Kafka topics are created for capturing failure messages in case of failure scenarios -
customer_failure_topic
account_failure_topic
entitymap_failure_topic

2- When a msvc fails, a failure message is published to the corresponding failure topic
3- When “AccountMgt” publishes to the “account_failure_topic”, “CustomerMgt” which is subscribed to this topic, will be notified and initiate the DB rollback activity.
4- When “EntityAggregation” publishes to the “entitymap_failure_topic”, “AccountMgt” & “CustomerMgt” which is subscribed to this topic, will be notified and initiate the DB rollback activity.
5- Thus SAGA compensation will accomplish txn atomicity and restore the original system state

SAGA Interactions

State of the DB -

SAGA-Customer-Rollback
SAGA-Account-Rollback
CustomerAccountIntegration-Success

New Architecture Positives -

1- SAGA enables robustness and graceful system integrity restoration
2- Provides a sturdy & flexible mechanism to accomplish process atomicity
3- Eliminates tight coupling via publish & subscribe, as needed

New Architecture Negatives -

1- SAGA adds to system complexity and maintenance
2- Observability, system debugging and tracing capabilities need to be significantly ramped up
3- There is a danger that the rollback operation triggered by the SAGA may itself fail.
4- Retries may be needed in such scenarios which further complicate matters

Summary : Architecture, TechStack & Rationale
Github : https://github.com/vijayredkar/event-driven-platform-with-saga

Architecture, TechStack & Rationale

--

--

Vijay Redkar

15+ years Java professional with extensive experience in Digital Transformation, Banking, Payments, eCommerce, Application architecture and Platform development