Chaos Engineering With Litmus: A CNCF Incubating Project

Tutorials

Kubernetes

Problem statement: Ensuring the resilience of a microservices-based e-commerce platform.

System resilience stands as the key requirement for e-commerce platforms during scaling operations to keep services operational and deliver performance excellence to users. We have developed a microservices architecture platform that encounters sporadic system failures when faced with heavy traffic events. The problems with degraded service availability along with revenue impact occur mainly because of Kubernetes pod crashes along with resource exhaustion and network disruptions that hit during peak shopping seasons.

The organization plans to utilize the CNCF-incubated project Litmus for conducting assessments and resilience enhancements of the platform. Our system weakness points become clearer when we conduct simulated failure tests using Litmus, which allows us to trigger real-world failure situations like pod termination events and network delays, and resource usage limits. The experiments enable us to validate scalability automation systems while testing disaster recovery procedures and maximize Kubernetes settings toward total system reliability.

The system creates a solid foundation to endure failure situations and distribute busy traffic periods safely without deteriorating user experience quality. Chaos engineering applied proactively to our infrastructure enables better risk reduction and increased observability, which allows us to develop automated recovery methods that enhance our platform’s e-commerce resilience to every operational condition.

Set Up the Chaos Experiment Environment

Install LitmusChaos in your Kubernetes cluster:

Shell

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

helm repo update

helm install litmus litmuschaos/litmus

Verify installation:

Shell

kubectl get pods -n litmus

Note: Ensure your cluster is ready for chaos experiments.

Define the Chaos Experiment

Create a ChaosExperiment YAML file to simulate a Pod Delete scenario.

Example (pod-delete.yaml):

YAML

apiVersion: litmuschaos.io/v1alpha1

kind: ChaosExperiment

metadata:

  name: pod-delete

  namespace: litmus

spec:

  definition:

    scope: Namespaced

    permissions:

      - apiGroups: ["*"]

        resources: ["*"]

        verbs: ["*"]

    image: "litmuschaos/go-runner:latest"

    args:

      - -c

      - ./experiments/generic/pod_delete/pod_delete.test

    command:

      - /bin/bash

Install ChaosOperator and Configure Service Account

Deploy ChaosOperator to manage experiments:

Shell

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/litmus-operator/cluster-k8s.yml

Note: Create a ServiceAccount to grant necessary permissions.

Inject Chaos into the Target Application

Label the application namespace for chaos:

Shell

kubectl label namespace <target-namespace> litmuschaos.io/chaos=enabled

Deploy a ChaosEngine to trigger the experiment:

Example (chaosengine.yaml):

YAML

apiVersion: litmuschaos.io/v1alpha1

kind: ChaosEngine

metadata:

  name: pod-delete-engine

  namespace: <target-namespace>

spec:

  appinfo:

    appns: '<target-namespace>'

    applabel: 'app=<your-app-label>'

    appkind: 'deployment'

  chaosServiceAccount: litmus-admin

  monitoring: false

  experiments:

    - name: pod-delete

Apply the ChaosEngine:

Shell

kubectl apply -f chaosengine.yaml

Monitor the Experiment

View the progress:

Shell

kubectl describe chaosengine pod-delete-engine -n <target-namespace>

Check the status of the chaos pods:

Shell

kubectl get pods -n <target-namespace>

Analyze the Results

Post-experiment, review logs and metrics to determine if the application recovered automatically or failed under stress.

Here are some metrics to monitor:

Application response time
Error rates during and after the experiment
Time taken for pods to recover

Solution

Root cause identified: During high traffic, pods failed due to an insufficient number of replicas in the deployment and improper resource limits.

Fixes applied:

Increased the number of replicas in the deployment to handle higher traffic
Configured proper resource requests and limits for CPU and memory in the pod specification
Implemented a Horizontal Pod Autoscaler (HPA) to handle traffic spikes dynamically

Conclusion

By using LitmusChaos to simulate pod failures, we identified key weaknesses in the e-commerce platform’s Kubernetes deployment. The chaos experiment demonstrated that resilience can be significantly improved with scaling and resource allocation adjustments. Chaos engineering enabled proactive system hardening, leading to better uptime and customer satisfaction.

Source:
https://dzone.com/articles/chaos-engineering-litmus-cncf-incubating-project