Home / Docs / Release 0.6.2 / Tutorials / Self-healing / Up-scale with Dynatrace
Demonstrates how to use the self-healing mechanisms of Keptn to solve a problem in a production stage by providing automated upscaling.
In this tutorial, you will learn how to use the capabilities of Keptn to provide self-healing for an application without modifying code. The following tutorial will scale up the pods of an application if the application undergoes heavy CPU saturation.
To inform Keptn about any issues in a production environment, monitoring has to be set up. The Keptn CLI helps with the automated setup and configuration of Dynatrace as the monitoring solution running in the Kubernetes cluster.
To add these files to Keptn and to automatically configure Dynatrace, execute the following commands:
Make sure you are in the correct folder of your examples directory:
cd examples/onboarding-carts
Configure remediation actions for up-scaling based on Dynatrace alerts:
keptn add-resource --project=sockshop --stage=production --service=carts --resource=remediation.yaml --resourceUri=remediation.yaml
Configure Dynatrace with the Keptn CLI:
keptn configure monitoring dynatrace --project=sockshop
Click here to inspect the file that has been added.
remediation.yaml
remediations:
- name: response_time_p90
actions:
- action: scaling
value: +1
- name: Response time degradation
actions:
- action: scaling
value: +1
Configure Dynatrace problem detection with a fixed threshold: For the sake of this demo, we will configure Dynatrace to detect problems based on fixed thresholds rather than automatically.
Log in to your Dynatrace tenant and go to Settings > Anomaly Detection > Services.
Within this menu, select the option Detect response time degradations using fixed thresholds, set the limit to 1000ms, and select Medium for the sensitivity as shown below.
Note: You can configure those fixed thresholds per service instead of globally.
To simulate user traffic that is causing an unhealthy behavior in the carts service, please execute the following script. This will add special items into the shopping cart that cause some extensive calculation.
Move to the correct folder:
cd ../load-generation/bin
Start the load generation script depending on your OS (replace _OS_ with linux, mac, or win):
./loadgenerator-_OS_ "http://carts.sockshop-production.$(kubectl get cm keptn-domain -n keptn -o=jsonpath='{.data.app_domain}')" cpu
Optional: Verify the load in Dynatrace
In your Dynatrace Tenant, inspect the Response Time chart of the correlating service entity of the carts microservice. Hint: You can find the service in Dynatrace easier by selecting the management zone Keptn: sockshop production:
As you can see in the time series chart, the load generation script causes a significant increase in the response time.
After approximately 10-15 minutes, Dynatrace will send out a problem notification because of the response time degradation.
After receiving the problem notification, the dynatrace-service will translate it into a Keptn CloudEvent. This event will eventually be received by the remediation-service that will look for a remediation action specified for this type of problem and, if found, execute it.
In this tutorial, the number of pods will be increased to remediate the issue of the response time increase.
Check the executed remediation actions by executing:
kubectl get deployments -n sockshop-production
You can see that the carts-primary
deployment is now served by two pods:
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
carts-db 1 1 1 1 37m
carts-primary 2 2 2 2 32m
Besides, you should see an additional pod running when you execute:
kubectl get pods -n sockshop-production
NAME READY STATUS RESTARTS AGE
carts-db-57cd95557b-r6cg8 1/1 Running 0 38m
carts-primary-7c96d87df9-75pg7 2/2 Running 0 33m
carts-primary-7c96d87df9-78fh2 2/2 Running 0 5m
To get an overview of the actions that got triggered by the response time SLO violation, you can use the Keptn Bridge. If you have not exposed the Bridge yet, execute the following command:
keptn configure bridge --action=expose
The Keptn Bridge is then available on: `https://bridge.keptn.YOUR.DOMAIN/.
In this example, the bridge shows that the remediation service triggered an update of the configuration of the carts service by increasing the number of replicas to 2. When the additional replica was available, the wait-service waited for 10 minutes for the remediation action to take effect. Afterwards, an evaluation by the lighthouse-service was triggered to check if the remediation action resolved the problem. In this case, increasing the number of replicas achieved the desired effect since the evaluation of the service level objectives has been successful.
Furthermore, you can see how the response time of the service decreased by viewing the time series chart in Dynatrace:
As previously, go to the response time chart of the ItemsController service. Here you will see that the additional instance has helped to bring down the response time. Eventually, the problem that has been detected earlier will be closed automatically.