Publications by Author: Mark Hempstead

2020
Liu Ke, Udit Gupta, Carole-Jean Wu, Benjamin Cho, Mark Hempstead, Brandon Reagen, Xuan Zhang, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, and Xiaodong Wang. 5/30/2020. “RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing.” In . The 47th IEEE/ACM International Symposium on Computer Architecture (ISCA 2020). Publisher's VersionAbstract
Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.
RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing
2012
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The accelerator store: A shared memory framework for accelerator-based systems.” ACM Transactions on Architecture. and Code Optimization, 8, 4, Pp. 1-22. Publisher's VersionAbstract

This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip's high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%--8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.

The accelerator store: A shared memory framework for accelerator-based systems
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 1/2012. “The Accelerator Store: a shared memory framework for accelerator-based systems.” Transactions on Architecture and Code Optimization (TACO). Publisher's VersionAbstract
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.
The accelerator store: A shared memory framework for accelerator-based systems
2011
Mark Hempstead, David Brooks, and Gu Wei. 7/2011. “An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS.” Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 1, 2, Pp. 193–202. Publisher's VersionAbstract
Networks of ultra-low-power nodes capable of sensing, computation, and wireless communication have applications in medicine, science, industrial automation, and security. Reducing power consumption requires the development of system-on-chip implementations that must provide both energy efficiency and adequate performance to meet the demands of the long deployment lifetimes and bursts of computation that characterize wireless sensor network (WSN) applications. Therefore, this work argues that designers should evaluate the design in terms of average power for an entire workload, including active and idle periods, not just the metric of energy-per-instruction.
An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS
2010
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 2/2010. “The Accelerator Store framework for high-performance, low-power accelerator-based systems.” IEEE Computer Architecture Letters, 9, 2, Pp. 53-56. Publisher's VersionAbstract
Hardware acceleration can increase performance and reduce energy consumption. To maximize these benefits, accelerator- based systems that emphasize computation on accelerators (rather than on general purpose cores) should be used. We introduce the “accelerator store,” a structure for sharing memory between accelerators in these accelerator-based systems. The accelerator store simplifies accelerator I/O and reduces area by mapping memory to accelerators when needed at runtime. Preliminary results demonstrate a 30% system area reduction with no energy overhead and less than 1% performance overhead in contrast to conventional DMA schemes.
The Accelerator Store framework for high-performance, low-power accelerator-based systems
2009
Mark Hempstead, Gu Wei, and David Brooks. 2009. “Navigo: An early-stage model to study power-contrained architectures and specialization.” Workshop on Modeling, Benchmarking, and Simulation.Abstract
As the number of transistors double, it becomes difficult to power all of them within a strict power budget and still achieve the performance gains of that the industry has achieved historically. This work presents, Navigo, a modeling framework for architecture exploration across future process technology generations. The model includes support for voltage and frequency scaling based on ITRS and PTM models. This work is designed to aid architects in the planning stages of next generation microprocessors, by addressing the space between early-stage back-of-the-envelope calculations and later stage cycle accurate simulators. Using parameters from existing commercial processor cores, we show how power consumption limits the theoretical throughput of future processors. Navigo shows that specialization is the answer to circumvent the power density limit that curbs performance gains and resume traditional 1.58x performance growth trends. We present analysis, using next generation of process technologies, that shows the fraction of area that must be allocated for specialization to maintain performance growth must increase with each new generation of process technology.
Navigo: An early-stage model to study power-contrained architectures and specialization
2008
Mark Hempstead, Gu Wei, and David Brooks. 5/18/2008. “System design considerations for sensor network applications.” In 2008 IEEE International Symposium on Circuits and Systems (ISCAS), Pp. 2566–2569. Seattle, WA: IEEE. Publisher's VersionAbstract
Systems research in the emerging space of wireless sensor networks has exploded. Researchers have deployed nodes composed of a wireless radio, MEMS sensors and low power computation for applications from medical sensing to volcanic monitoring. We must consider several requirements - including the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements - when designing devices for wireless sensor networks. An untethered, fully-integrated node that operates off of energy scavenged from the ambient environment is the ultimate goal. We take an application-driven approach to the design of a wireless sensor network node. Our approach addresses the event-driven nature that is characteristic of many sensor network workloads. We have completed a detailed architectural analysis of this space using a full-system simulator and RTL model. From this analysis, we chose to implement a design that best achieves the power goals and performance requirements of wireless sensor network applications.
System design considerations for sensor network applications
Mark Hempstead, Michael Lyons, David Brooks, and Gu Wei. 4/2008. “Survey of hardware systems for wireless sensor networks.” Journal of Low Power Electronics, 4, 1, Pp. 11–20. Publisher's VersionAbstract
Wireless sensor networks have been gaining interest as a platform that changes how we interact with the physical world. Applications in medicine, military, inventory management, structural and environmental monitoring, and the like can benefit from low-power wireless nodes that communicate data collected via a variety of sensors. Current deployments of wireless sensor networks (WSN) rely on off-the-shelf commodity-based microcontrollers, but the unoptimized energy consumption of these systems can limit the effective lifetimes. Ideally, researchers would like to deeply embed wireless sensor network nodes in the physical world, relying on energy scavenged from the ambient environment. This paper provides a survey of ultra low power processors specifically designed for WSN applications that have begun to emerge from research labs, which require detailed understanding of tradeoffs between application space, architecture, and circuit techniques to implement these low-power systems.
Survey of hardware systems for wireless sensor networks
2007
Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu Wei, and David Brooks. 2007. “Ultra low power system for sensor network applications.” 32nd International Symposium on Computer Architecture (ISCA'05). Publisher's VersionAbstract
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
Ultra low power system for sensor network applications
2006
Mark Hempstead, Gu Wei, and David Brooks. 10/2006. “Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations.” In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 368–378. ACM. Publisher's VersionAbstract

Rising interest in the applications of wireless sensor networks has spurred research in the development of computing systems for low-throughput, energy-constrained applications. Unlike traditional performance oriented applications, sensor network nodes are primarily constrained by operation lifetime, which is limited by power consumption. Advanced CMOS process technologies provide ever increasing transistor density and improved performance characteristics. However, shrinking feature size and decreasing threshold voltages also lead to significant increases in leakage current, which is especially troublesome for applications with significant idle times. This work investigates tradeoffs between leakage and active power for low-throughput applications. We study these issues across a range of process technologies on a computing architecture that provides explicit support for fine-grain leakage-control techniques such as Vdd-gating and adaptive body bias. We present a methodology for selecting design parameters, including choice of process technology, that makes the optimal tradeoff between active power and leakage power for a given workload. Our results show that leakage power will dominate the selection of process technology, and architectures that support advanced leakage control techniques at the circuit level will be essential. We argue that without advanced low-power architectures future nano-scale process technologies will not be suited for sensor network applications.

Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations
2005
Yingmin Li, Mark Hempstead, Patrick Mauro, David Brooks, Zhigang Hu, and Kevin Skadron. 8/2005. “Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices.” In Proceedings of the 2005 international symposium on Low power electronics and design, Pp. 173–178. ACM. Publisher's VersionAbstract

This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain.We study both SRAM and latch and multiplexer ("latch-mux") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the "unconstrained" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when "stall gating" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit-design choices when performing early-stage design exploration

Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices
Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu Wei, and David Brooks. 6/4/2005. “An ultra low power system architecture for sensor network applications.” In ACM SIGARCH Computer Architecture News, 33: Pp. 208–219. Madison, WI, USA: IEEE Computer Society. Publisher's VersionAbstract
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
An ultra low power system architecture for sensor network applications
2004
Mark Hempstead, Matt Welsh, and David Brooks. 11/16/2004. “TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices.” In Local Computer Networks, 11/16/2004. 29th Annual IEEE International Conference on, Pp. 585–586. Tampa, FL, USA: IEEE. Publisher's VersionAbstract
The growing wireless sensor network research community lacks a standard method for evaluating hardware platforms. Traditional benchmark suites do not sufficiently address the needs of sensor network designers. This work provides motivation for a benchmark suite and details an approach for benchmarking TinyOS compatible hardware. To aid the development of future hardware architectures, we propose the creation of a standard single node benchmark suite, based on both real applications and "stressmarks." We present sample benchmark results and call for further work in this area.
TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices