Event-driven : SAGA Compensation
Business Objective -
“BankNext” employed event driven choreography to enhance it’s processing capacity manifolds without compromising on system flexibility. Things were good until Business noticed an abnormal system behavior when reconciling customer and account information. The requirement was that, every active customer in the system must have a valid account. It was noticed that there were many active customers without any account.
Problem Scenarios-
1- Dangling Customers - Active Customers without any valid Account
2- Linkage not created - Valid Customers and Accounts present but linkage between these entities is missing
How did that happen?
BankNext’s current event driven choreography (w/o SAGA) can be found here
Back to the Drawing Board-
After careful analysis, the Engineering team observed that -
1- BankNext’s event driven system is composed of multiple msvcs that independently manage data persistence
2- If any of the msvcs in this transactional flow fails then there is no provision to correct/rollback the data that was already persisted by an upstream msvc
3- This was the root cause of severe data integrity problems in the system
Technical Challenge -
1- The “CustomerMgt” msvc successfully creates the customer entity in the “customer” table in the RDBMS
2- Next, “AccountMgt” msvc creates the account entity in the “account” DB table , after receiving the subscription on the Kafka topic
3- In the last step, “EntityAggregation” msvc links these 2 entities by creating an entry in to the “customer_account_map” table in the RDBMS
4- The process is considered successful if & only if the above 3 steps are atomic i.e. either 3 steps complete successfully or none complete
5- If/when the “Accounts” or “EntityAggregation” msvc fails for any reason then the data integrity problems occur
Example scenario -1
-Customer entity created successfully.
-Account Msvc fails the request per the business validation rule (eg. only “SAVINGS” & “CURRENT” allowed but “OVERDRAFTS” received)
-Account entity creation aborts.
-Customer is left without an Account. Thus Dangling Customer.
Example scenario -2
-Customer entity created successfully.
-Account entity created successfully.
-EntityAggregation Msvc fails the request per the business validation rule (eg. for “USA” only type allowed is “SAVINGS” but “CURRENT” received)
-Customer and Account linking aborts
-Active Customer and Account is created but are not linked
Solution Approach : SAGA compensation
Engineering team concluded that the architecture lacked a self correcting/compensating mechanism
1- In such a failure scenario, the solution requires that all the data persisted by upstream msvcs be rolled back
2- Thus the system data is brought back to the state that it was before this txn started
3- SAGA approach is recommended to accomplish this task
Technical Implementation : SAGA compensation (Github)
1- New Kafka topics are created for capturing failure messages in case of failure scenarios -
customer_failure_topic
account_failure_topic
entitymap_failure_topic
2- When a msvc fails, a failure message is published to the corresponding failure topic
3- When “AccountMgt” publishes to the “account_failure_topic”, “CustomerMgt” which is subscribed to this topic, will be notified and initiate the DB rollback activity.
4- When “EntityAggregation” publishes to the “entitymap_failure_topic”, “AccountMgt” & “CustomerMgt” which is subscribed to this topic, will be notified and initiate the DB rollback activity.
5- Thus SAGA compensation will accomplish txn atomicity and restore the original system state
State of the DB -
New Architecture Positives -
1- SAGA enables robustness and graceful system integrity restoration
2- Provides a sturdy & flexible mechanism to accomplish process atomicity
3- Eliminates tight coupling via publish & subscribe, as needed
New Architecture Negatives -
1- SAGA adds to system complexity and maintenance
2- Observability, system debugging and tracing capabilities need to be significantly ramped up
3- There is a danger that the rollback operation triggered by the SAGA may itself fail.
4- Retries may be needed in such scenarios which further complicate matters
Summary : Architecture, TechStack & Rationale
Github : https://github.com/vijayredkar/event-driven-platform-with-saga