Publications by Author: Kim Hazelwood

2020
Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, and Xiaodong Wang. 5/30/2020. “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing.” In . The 47th IEEE/ACM International Symposium on Computer Architecture (ISCA 2020). Publisher's VersionAbstract
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
Udit Gupta, Carole Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2020. “The Architectural Implications of Facebook's DNN-based Personalized Recommendation.” In . The 26th IEEE International Symposium on High-Performance Computer Architecture. Publisher's VersionAbstract
The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.
The Architectural Implications of Facebook's DNN-based Personalized Recommendation
2019
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 10/2/2019. “Mlperf training benchmark.” arXiv preprint arXiv:1910.01500. Publisher's VersionAbstract
Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.
Mlperf training benchmark
2015
Svilen Kanev, Juan Darago, Kim Hazelwood, Tipp Moseley, Gu Wei, and David Brooks. 6/13/2015. “Profiling a Warehouse-Scale Computer.” In International Symposium on Computer Architecture (ISCA). Publisher's VersionAbstract
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This \"datacenter tax\" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
Profiling a Warehouse-Scale Computer
2014
Svilen Kanev, Kim Hazelwood, Gu Wei, and David Brooks. 10/26/2014. “Tradeoffs between power management and tail latency in warehouse-scale applications.” In 2014 IEEE International Symposium on Workload Characterization (IISWC), Pp. 31–40. IEEE. Publisher's VersionAbstract
The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) - the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Google's datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.
Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications
Svilen Kanev, Kim Hazelwood, Gu Wei, and David Brooks. 10/26/2014. “Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications.” In International Symposium on Workload Characterization (IISWC), Pp. 31-40. IEEE. Publisher's VersionAbstract
The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) – the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Google’s datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.
Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications
2010
Vijay Reddi, Simone Campanoni, Meeta Gupta, Michael Smith, Gu Wei, David Brooks, and Kim Hazelwood. 9/2010. “Eliminating voltage emergencies via software-guided code transformations.” ACM Transactions on Architecture and Code Optimization (TACO), 7, 2, Pp. 1-28. Publisher's VersionAbstract
In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this paper, we present a hardware-software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a run-time software layer reschedules the program’s instruction stream to prevent recurring violations at the same program location. The run-time layer, combined with the proposed code rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry standard approach to circumvent the issue altogether by optimizing for the worst case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
Eliminating voltage emergencies via software-guided code transformations
2004
Kim Hazelwood and David Brooks. 8/2004. “Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 326–331. ACM. Publisher's VersionAbstract

Microprocessor designers use techniques such as clock gating to reduce power dissipation. An unfortunate side-effect of these techniques is the processor current fluctuations that stress the power-delivery network. Recent research has focused on hardware-only mechanisms to detect and eliminate these fluctuations. While the solutions have been effective at avoiding operating-range violations, they have done so at a performance penalty to the executing program.

Compilers are well equipped to rearrange instructions such that current fluctuations are less dramatic, with minimal performance implications. Furthermore, a dynamic optimizer can eliminate the problem at run time, avoiding the difficult task of statically predicting voltage emergencies.

This paper proposes complementing existing hardware solutions with additional run-time software to address problematic code sequences that cause recurring voltage swings. Our proposal extends existing hardware techniques to additionally provide feedback to a dynamic optimizer, which can provide a long-term solution, often without impacting the performance of the executing application.

We found that recurring voltage fluctuations do exist in the SPEC2000 benchmarks, and that given very little information from the hardware, a dynamic optimizer can locate and correct many of the recurring voltage emergencies.

Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization