BankNext Case Study - Troubleshoot Production w/ ServiceMesh Istio Metrics - Part 2

5 min readOct 9, 2022

Troubleshoot real production scenarios w/ ServiceMesh Istio metrics

BankNext demonstrates it’s expedited troubleshooting techniques :
- Real production problems, are usually accompanied w/ ambiguities.
- Reported degraded performance may be intermittent in nature.
- We may not even have the specific problematic txn ids to start with.
- Finding a pattern to start the troubleshooting becomes elusive.
- Continued SLA breaches inflict significant financial cost on the org.
- Expedited root cause analysis capability becomes extremely crucial.

This article details systematic approaches to recover from real-life production breakdowns.

Production Workflow : Expected Behavior

Application Setup -
a. Part-1 details the BankNext production application setup steps.
b. On completion, below view shows successful credit-check workflow.
c. Txns flow between 3 mSvcs deployed in 3 separate namespaces.
d. Interactions involve Kafka message publish & consumption.
e. Mongo is used to persist & fetch data.
f. GitHub - troubleshoot real production scenarios

Challenge 1 : Latency Unexplained

Problem Statement -
a. Consumers reported unusually high latencies in the past 1 week.
b. Critical SLA breaches observed.
c. Operations team unable to find a pattern to troubleshoot.
d. Engineering team not provided w/ specific problematic txn ids.
e. Escalations going through the roof.
Root Cause Analysis w/ Kiali -
a. Scouring vast log dumps to find signs of anomalies is impractical.
b. Instead we can use Kiali’s focused span logs view.
Span Logs -
a. This view lists the durations of all the txns over a period of time.
b. It clearly shows txns that took more than 10 secs, at times.
c. These same type of txns completed within a second, at other times.
d. We can now narrow down to the problematic high latency txns.

Kiali — Span Logs View — Kiali - Span Logs View

Application Logs -
a. Enabling the application logs view provides detailed logs for this txn.
b. The cause of latency is clearly evident in these logs.
c. The delay is due to Kafka unavailability during those instances.

Scenario Simulation -
a. If you wish to simulate this latency, run the below script.
b. Watch the span logs, as explained above.

#generate latency error — kafka unavailable
cd \kyc-k8-docker-istio\networking
./6B-operations_Istio_Test_Latency_Error_Scenario_script_run.sh#after the simulation, to reinstate working Kafka
./2-operations_Istio_Kafka_script_run.sh

Challenge 2 : Transactional Failures

Problem Statement -
a. Consumers unable to fetch expected records from Mongo.
b. This causes downstream services to fail on compliance reporting.
c. The bulk insertion process is periodic & automated.
f. Any Mongo operation failures do not surface immediately.
d. This issue does not occur consistently/predictably.
e. Narrowing down the window of insertion errors is challenging.
f. For every compliance failure, the org incurs heavy financial penalty.
Root Cause Analysis w/ Kiali -
a. Tracking the signs of error from logs of >100 mSvcs is impractical.
b. Instead we can use Kiali’s unified graph + trace + span + metrics view.
Graph View -
a. Workloads Graph provides a snapshot of all production mSvcs.
b. Any red edges indicate potential failures that could be of our interest.

Traces View -
a. Traces show all the txn hits/errors over a period of time.
b. The specific individual trace shows the span details view.

Kiali — Traces View — Kiali - Traces View

Span View -
a. Span shows details of child txns in the sequence of execution.
b. Displays granular details of each child txn.
c. Shows the point of failure & navigation to the application logs.

Application Logs -
a. This takes us to the exact location in the vast application logs.
b. The cause of this txn failure is clearly evident in these logs.
c. The txn failed due to Mongo unique constraint error.

Kiali - Root Cause w/ Application Logs View

Scenario Simulation -
a. If you wish to simulate this Mongo txn failure, run the below script.
b. Watch the traces view, as explained above.

#generate Mongo unique constraint errorkubectl exec "$(kubectl get pod -l app=kyc-aggregator-mgt -n consumer -o jsonpath={.items..metadata.name})" -c kyc-aggregator-mgt -n consumer - curl http://kyc-credit-check-advanced.advanced:8080/credit-check/advanced?triggerMongoError=Y -s -o /dev/null -w "%{http_code}\n"

Challenge 3 : Abrupt POD restarts - Heap Memory Leak

Problem Statement -
a. Operations team reported frequent application restarts.
b. There was no recent production deployment for this application.
c. This application has been stable over a substantial period of time.
f. The restarts are abrupt and ununiform.
d. Crucial in-flight txns are lost due to these unanticipated restarts.
e. Overall system atomic consistency is gravely threatened.
Root Cause Analysis w/ Kiali + Grafana -
a. Kiali application logs view show instances of out of memory errors.
b. The restarts could potentially be a consequence of these errors.
c. Grafana view provides detailed statistics on POD memory usage.

Kiali Application Logs Heap Memory Overloading

Grafana View -
a. Below Grafana view shows the Heap memory utilization of the POD.
b. The graph clearly shows memory usage linearly increasing.
c. This is an indicator of memory leak i.e. memory is not released.
d. This causes the POD to crash and consequently restart.

Grafana view - POD Memory Leak Indicator

Grafana view - Linear Increase in POD Memory Usage — Grafana view — Linear Increase in POD Memory Usage

Scenario Simulation -
a. If you wish to simulate this memory leak failure, run the below script.
b. Watch the Grafana memory usage view, as explained above.

#generate memory leak - Heap memory overloadcd \kyc-k8-docker-istio\networking
./6D-operations_Istio_Test_Heap_Memory_Breach_Scenario_script_run.sh#Grafana Prometheus query - display POD memory usage graphscontainer_memory_working_set_bytes{image!="", container_name!="POD", namespace="advanced", container="kyc-credit-check-advanced"}

Scenario 4 : CPU Usage Overload

to be contd. in Part 3..

Scenario 5 : Peak Utilization Monitoring

to be contd. in Part 3..

Scenario 6: Threads deadlocked

to be contd. in Part 3..

Conclusion :

- Most critical real-life production breakdowns were analyzed in detail.
- Demonstrated efficient strategies to narrow down failure points.
- Tools & dashboard setup elaborated for expedited troubleshooting.

BankNext Case Study - Troubleshoot Production w/ ServiceMesh Istio Metrics - Part 2

Written by Vijay Redkar