BankNext Case Study - Troubleshoot Production w/ ServiceMesh Istio Metrics - Part 2

Vijay Redkar
5 min readOct 9, 2022

Troubleshoot real production scenarios w/ ServiceMesh Istio metrics

BankNext demonstrates it’s expedited troubleshooting techniques :
- Real production problems, are usually accompanied w/ ambiguities.
- Reported degraded performance may be intermittent in nature.
- We may not even have the specific problematic txn ids to start with.
- Finding a pattern to start the troubleshooting becomes elusive.
- Continued SLA breaches inflict significant financial cost on the org.
- Expedited root cause analysis capability becomes extremely crucial.

This article details systematic approaches to recover from real-life production breakdowns.

Production Workflow : Expected Behavior

  • Application Setup -
    a. Part-1 details the BankNext production application setup steps.
    b. On completion, below view shows successful credit-check workflow.
    c. Txns flow between 3 mSvcs deployed in 3 separate namespaces.
    d. Interactions involve Kafka message publish & consumption.
    e. Mongo is used to persist & fetch data.
    f. GitHub - troubleshoot real production scenarios
Expected Healthy System Interactions
Expected Healthy System Interactions

Challenge 1 : Latency Unexplained

  • Problem Statement -
    a. Consumers reported unusually high latencies in the past 1 week.
    b. Critical SLA breaches observed.
    c. Operations team unable to find a pattern to troubleshoot.
    d. Engineering team not provided w/ specific problematic txn ids.
    e. Escalations going through the roof.
  • Root Cause Analysis w/ Kiali -
    a. Scouring vast log dumps to find signs of anomalies is impractical.
    b. Instead we can use Kiali’s focused span logs view.
  • Span Logs -
    a. This view lists the durations of all the txns over a period of time.
    b. It clearly shows txns that took more than 10 secs, at times.
    c. These same type of txns completed within a second, at other times.
    d. We can now narrow down to the problematic high latency txns.
Kiali — Span Logs View
Kiali - Span Logs View
  • Application Logs -
    a. Enabling the application logs view provides detailed logs for this txn.
    b.
    The cause of latency is clearly evident in these logs.
    c. The delay is due to Kafka unavailability during those instances.
Kiali -  Application Logs View
Kiali - Application Logs View
  • Scenario Simulation -
    a. If you wish to simulate this latency, run the below script.
    b. Watch the span logs, as explained above.
#generate latency error — kafka unavailable
cd \kyc-k8-docker-istio\networking
./6B-operations_Istio_Test_Latency_Error_Scenario_script_run.sh
#after the simulation, to reinstate working Kafka
./2-operations_Istio_Kafka_script_run.sh

Challenge 2 : Transactional Failures

  • Problem Statement -
    a. Consumers unable to fetch expected records from Mongo.
    b. This causes downstream services to fail on compliance reporting.
    c. The bulk insertion process is periodic & automated.
    f. Any Mongo operation failures do not surface immediately.
    d. This issue does not occur consistently/predictably.
    e. Narrowing down the window of insertion errors is challenging.
    f. For every compliance failure, the org incurs heavy financial penalty.
  • Root Cause Analysis w/ Kiali -
    a. Tracking the signs of error from logs of >100 mSvcs is impractical.
    b. Instead we can use Kiali’s unified graph + trace + span + metrics view.
  • Graph View -
    a. Workloads Graph provides a snapshot of all production mSvcs.
    b. Any red edges indicate potential failures that could be of our interest.
Problematic System Interactions
Problematic System Interactions
  • Traces View -
    a. Traces show all the txn hits/errors over a period of time.
    b. The specific individual trace shows the span details view.
Kiali — Traces View
Kiali - Traces View
  • Span View -
    a. Span shows details of child txns in the sequence of execution.
    b. Displays granular details of each child txn.
    c. Shows the point of failure & navigation to the application logs.
Kiali - Span Details View
Kiali - Span Details View
  • Application Logs -
    a. This takes us to the exact location in the vast application logs.
    b.
    The cause of this txn failure is clearly evident in these logs.
    c. The txn failed due to Mongo unique constraint error.
Kiali - Root Cause w/ Application Logs View
Kiali - Root Cause w/ Application Logs View
  • Scenario Simulation -
    a. If you wish to simulate this Mongo txn failure, run the below script.
    b. Watch the traces view, as explained above.
#generate Mongo unique constraint errorkubectl exec "$(kubectl get pod -l app=kyc-aggregator-mgt -n consumer -o jsonpath={.items..metadata.name})" -c kyc-aggregator-mgt -n consumer - curl http://kyc-credit-check-advanced.advanced:8080/credit-check/advanced?triggerMongoError=Y -s -o /dev/null -w "%{http_code}\n"

Challenge 3 : Abrupt POD restarts - Heap Memory Leak

  • Problem Statement -
    a. Operations team reported frequent application restarts.
    b. There was no recent production deployment for this application.
    c. This application has been stable over a substantial period of time.
    f. The restarts are abrupt and ununiform.
    d. Crucial in-flight txns are lost due to these unanticipated restarts.
    e. Overall system atomic consistency is gravely threatened.
  • Root Cause Analysis w/ Kiali + Grafana -
    a. Kiali application logs view show instances of out of memory errors.
    b. The restarts could potentially be a consequence of these errors.
    c. Grafana view provides detailed statistics on POD memory usage.
Kiali Application Logs Heap Memory Overloading
  • Grafana View -
    a. Below Grafana view shows the Heap memory utilization of the POD.
    b. The graph clearly shows memory usage linearly increasing.
    c. This is an indicator of memory leak i.e. memory is not released.
    d. This causes the POD to crash and consequently restart.
Grafana view - POD Memory Leak Indicator
Grafana view - POD Memory Leak Indicator
Grafana view - Linear Increase in POD Memory Usage
Grafana view — Linear Increase in POD Memory Usage
  • Scenario Simulation -
    a. If you wish to simulate this memory leak failure, run the below script.
    b. Watch the Grafana memory usage view, as explained above.
#generate memory leak - Heap memory overloadcd \kyc-k8-docker-istio\networking
./6D-operations_Istio_Test_Heap_Memory_Breach_Scenario_script_run.sh
#Grafana Prometheus query - display POD memory usage graphscontainer_memory_working_set_bytes{image!="", container_name!="POD", namespace="advanced", container="kyc-credit-check-advanced"}

Scenario 4 : CPU Usage Overload

to be contd. in Part 3..

Scenario 5 : Peak Utilization Monitoring

to be contd. in Part 3..

Scenario 6: Threads deadlocked

to be contd. in Part 3..

Conclusion :

- Most critical real-life production breakdowns were analyzed in detail.
- Demonstrated efficient strategies to narrow down failure points.
- Tools & dashboard setup elaborated for expedited troubleshooting.

--

--

Vijay Redkar

15+ years Java professional with extensive experience in Digital Transformation, Banking, Payments, eCommerce, Application architecture and Platform development