DeepRecSys

Neural personalized recommendation is the cornerstone of many cloud services and products, and it imposes heavy compute demand on cloud infrastructure. Thus, improving its execution efficiency translates directly into infrastructure-capacity savings. In this paper, we propose DeepRecsSched, a recommendation-inference scheduler that maximizes latency-bounded throughput by taking into account inference-query size and arrival patterns, model architectures, and underlying hardware. By carefully optimizing task- versus data-level parallelism, DeepRecSched doubles the system throughput of server-class CPUs across eight industry-representative models. We deployed and evaluated this optimization in a production data center, reducing end-to-end tail latency by 30% for a wide variety of recommendation models. Finally, DeepRecSched demonstrates how specialized AI hardware can optimize system-level performance (QPS) and power efficiency (QPS per watt) for recommendation inference.

To enable our design-space exploration of custom recommendation systems, we built and validated an end-to-end modeling infrastructure, DeepRecInfra. It enables studies involving numerous recommendation use cases by taking into account at-scale effects, such as query-arrival patterns and recommendation-query sizes, in a production data center. DeepRecInfra also considers industry-representative models and tail-latency targets.

DeepRecSys has been released! Click here to see the source code.

People: Udit Gupta and Samuel Hsia

Publications

  • Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, Carole-Jean Wu. DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference. In: The IEEE/ACM International Symposium on Computer Architecture (ISCA 2020).