Understanding Thread Tuning for Inference Servers of Deep Models

Sessions
AI

DAY 1 13:30-14:00 JST Main Room A

Streaming

Understanding Thread Tuning for Inference Servers of Deep Models

To provide real-time predictions with machine learning models, an inference server is essential. When performing inference on a CPU, the setting of the number of threads greatly affects system performance. With appropriate settings, it is not uncommon for throughput to improve severalfold.

In this session, we will explain the method of tuning the number of threads for inference servers from the ground up. Using Triton Inference Server as an example, we will explain how inference servers allocate CPU threads and how this results in a trade-off between latency and throughput. By understanding these mechanisms, the goal of this presentation is to enable participants to confidently tune the number of threads.

Video Presentation Materials

Speaker

Zhan Yiyang / LY Corporation

Data Group

Joined Yahoo Japan Corporation in 2017. Loves Python and performance engineering.

Back to Sessions