Neural personalized recommendation is the cornerstone of many cloud services and products, and it imposes heavy compute demand on cloud infrastructure. Thus, improving its execution efficiency translates directly into infrastructure-capacity savings. In this paper, we propose DeepRecsSched, a recommendation-inference scheduler that maximizes latency-bounded throughput by taking into account inference-query size and arrival patterns, model architectures, and underlying hardware. By carefully optimizing task- versus data-level parallelism, DeepRecSched doubles the system throughput of server-class CPUs across eight industry-representative models. We deployed and evaluated this optimization in a production data center, reducing end-to-end tail latency by 30% for a wide variety of recommendation models. Finally, DeepRecSched demonstrates how specialized AI hardware can optimize system-level performance (QPS) and power efficiency (QPS per watt) for recommendation inference.

To enable our design-space exploration of custom recommendation systems, we built and validated an end-to-end modeling infrastructure, DeepRecInfra. It enables studies involving numerous recommendation use cases by taking into account at-scale effects, such as query-arrival patterns and recommendation-query sizes, in a production data center. DeepRecInfra also considers industry-representative models and tail-latency targets.

DeepRecSys has been released! Click here to see the source code.

People: Udit Gupta and Samuel Hsia


  • Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, Carole-Jean Wu. DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference. In: The IEEE/ACM International Symposium on Computer Architecture (ISCA 2020).