The memory wall bottleneck is a key challenge across many data-intensive applications. Multi-level FeFET-based embedded non-volatile memories are a promising solution for denser and more energy-efficient on-chip memory. However, reliable multi-level cell storage requires careful optimizations to minimize the design overhead costs. In this work, we investigate the interplay between FeFET device characteristics, programming schemes, and memory array architecture, and explore different design choices to optimize performance, energy, area, and accuracy metrics for critical data-intensive workloads. From our cross-stack design exploration, we find that we can store DNN weights and social network graphs at a density of over 8MB/mm 2 and sub-2ns read access latency without loss in application accuracy.
Bayesian models and inference is a class of machine learning that is useful for solving problems where the amount of data is scarce and prior knowledge about the application allows you to draw better conclusions. However, Bayesian models often requires computing high-dimensional integrals and finding the posterior distribution can be intractable. One of the most commonly used approximate methods for Bayesian inference is Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) technique to estimate target stationary distribution. The idea in Gibbs sampling is to generate posterior samples by iterating through each of the variables to sample from its conditional given all the other variables fixed. While Gibbs sampling is a popular method for probabilistic graphical models such as Markov Random Field (MRF), the plain algorithm is slow as it goes through each of the variables sequentially. In this work, we describe a binary label MRF Gibbs sampling inference architecture and extend it to 64-label version capable of running multiple perceptual applications, such as sound source separation and stereo matching. The described accelerator employs a chromatic scheduling of variables to parallelize all the conditionally independent variables to 257 samplers, imple- mented on the FPGA portion of a CPU-FPGA SoC. For real-time streaming sound source separation task, we show the hybrid CPU- FPGA implementation is 230x faster than a commercial mobile processor, while maintaining a recommended latency under 50 ms. The 64-label version showed 137x and 679x speedups for binary label MRF Gibbs sampling inference and 64 labels, respectively.
The recent surge of machine learning has motivated computer architects to focus intently on accelerating related workloads, especially in deep learning. Deep learning has been the pillar algorithm that has led the advancement of learning patterns from a vast amount of labeled data, or supervised learning. However, for unsupervised learning, Bayesian methods often work better than deep learning. Bayesian modeling and inference works well with unlabeled or limited data, can leverage informative priors, and has interpretable models. Despite being an important branch of machine learning, Bayesian inference generally has been overlooked by the architecture and systems communities. In this paper, we facilitate the study of Bayesian inference with the development of BayesSuite, a collection of seminal Bayesian inference workloads. We characterize the power and performance profiles of BayesSuite across a variety of current-generation processors and find significant diversity. Manually tuning and deploying Bayesian inference workloads requires deep understanding of the workload characteristics and hardware specifications. To address these challenges and provide high-performance, energy-efficient support for Bayesian infer- ence, we introduce a scheduling and optimization mechanism that can be plugged into a system scheduler. We also propose a computation elision technique that further improves the performance and energy efficiency of the workloads by skipping computations that do not improve the quality of the inference. Our proposed techniques are able to increase Bayesian inference performance by 5.8× on average over the naive assignment and execution of the workloads.
Developing the Robobee was a multi-discipline, 5-year project funded by a National Science Foundation Expeditions in Computing award with the goal of achieving autonomous flight with a bee-sized micro-robot . The intent of the research was to help re-stabilize the declining bee population which researchers have shown could have devastating effects on the earths ecosystem. Bees are remarkably efficient; their skeleton weighs almost nothing: requiring minimal lift to takeoff and sustain flight; their brains are small: pre-programmed with a minimal set of instincts necessary for the colonies survival. Their capabilities under such stringent weight and compute limitations makes them a prime target for pushing what modern robotics and computer systems can do. The weight and power limits require a custom System-onChip (SoC) be built. Conventional off-chip voltage regulators are heavy and bulky, and thus cannot fit under the weight and form factor of the robotic bee. Commercial Off-The-Shelf parts (COTS) micro-controllers consume too much power to perform the required computation for autonomous flight. The solution is to pack as much IP onto a single die. SoCs have been the trend of all semi-conductor companies over the past decade from mobile and embedded to server grade solutions. In this paper we recount our experiences designing such a chip. We highlight the major challenges faced when designing for such a unique form factor, how designs and specifications were set by each collaborating lab, the difficulties of integrating a plethora of IP consisting of in-house digital and analog blocks, and the design flows we used. We also discuss how invaluable HLS was in reducing the engineering burden, focusing design efforts at higher levels of abstraction, and an overall successful tape-out.
Recent years have seen an increased interest in Micro Air Vehicles (MAVs) with applications ranging from search-and-rescue to mimicking insect behavior. MAVs have several challenging design requirements that impact processor design. These include real time processing demands and severe power/weight budgets. In this paper, we describe the characteristics of MAV applications and propose hardware acceleration to improve the power, performance, and portability of MAV system designs.