You can auto-scale an LLM inference service in Kubernetes by configuring a HorizontalPodAutoscaler based on CPU or custom metrics.
Here is the code snippet below:

In the above code, we are using the following key points:
- 
A Deployment manages the lifecycle of LLM inference pods and defines their CPU resource requirements, including requests and limits. 
- 
A HorizontalPodAutoscaler (HPA) dynamically scales the number of pods between 2 and 10. 
- 
CPU utilization is used as the scaling metric, targeting 70% average usage. 
Hence, this configuration ensures scalable LLM inference aligned with real-time load.