How should you configure your real-time inference endpoint during the deployment? Which configuration options matter the most?
Let's start with the existing configuration parameters and what they affect:
The number of pods is the number of instances deployed in Kubernetes. The load balancer splits the traffic between the number of pods.
Within the pod, we can control the number of vCPUs (virtual CPUs), the amount of RAM, and whether we have a GPU (also the type of the GPU).
Inside the pod, we run the ML inference process with the number of workers controlled by the number of workers attribute. Every worker is a separate, forked process.
Additionally, we can control the maximal batch size which indicates how many rows of data exist in the DataFrame received in the model's predict function.
When you can't handle the traffic anymore by modifying the pod configuration.
When you want to use a large number of cheaper pods.
When you expect a spike in traffic, so you need more instances for a few hours and return to the previous value later.
- If you want to use more workers and handle multiple requests in parallel.
Don't waste vCPUs!
If you don't increase the number of workers but increase vCPUs, you will waste resources! Those additional CPUs won't be used.
In general, ML inference is a CPU-bound process, so we should follow the rule of having 1 vCPU per two worker processes. Of course, if you run a simple model, you may try increasing the number of workers per vCPU.
- When you increase the number of workers.
Every worker runs as a separate forked process, so there is no shared memory. In every worker, you have to load the inference service and the model.
Increase the number of workers if you need to handle more traffic and your pods still have some unused CPU capacity and RAM.
- When you have increased the max batch size per prediction request, you constantly send enough data to fill the entire batch and your CPUs don't keep up anymore.
Don't waste GPUs!
Don't deploy a GPU instance if you process requests one by one! GPUs exist to parallelize the computation. When you process a batch of size 1, a GPU won't give you any performance improvements.
If your code in the predict function and the model can handle more than one value (preferably without iterating over them in the predict function).
If you can group requests into batches (you have enough data to send and the client application can handle that).
Updated 12 months ago