To provide real-time predictions with machine learning models, an inference server is essential. When performing inference on a CPU, the setting of the number of threads greatly affects system performance. With appropriate settings, it is not uncommon for throughput to improve severalfold.
In this session, we will explain the method of tuning the number of threads for inference servers from the ground up. Using Triton Inference Server as an example, we will explain how inference servers allocate CPU threads and how this results in a trade-off between latency and throughput. By understanding these mechanisms, the goal of this presentation is to enable participants to confidently tune the number of threads.