Hartung-Gorre Verlag

Inh.: Dr. Renate Gorre

D-78465 Konstanz

Fon: +49 (0)7533 97227

Fax: +49 (0)7533 97228

www.hartung-gorre.de

S

Series in Microelectronics

edited by       Luca Benini,

Qiuting Huang,

Taekwang Jang,

Mathieu Luisier,

Christoph Studer,

Hua Wang

Vol. 246

 

 

 

Matheus Cavalcante

 

Fighting Back the von

Neumann Bottleneck with

Small- and Large-Scale

Vector Microprocessors

 

1st Edition 2023. XXVIII,194 pages. € 64,00.
ISBN 978-3-86628-801-0

 

 

 

 

 

 

 

 

 

 

 

 

 

Contents

 

Abstract:

 

In his seminal Turing Award Lecture, Backus discussed the issues stemming from the word-at-a-time style of programming inherited from the von Neumann computer. More than forty years later, in a context where Moore’s Law and Dennard’s scaling are no longer reliable sources of computer performance and energy efficiency gains, computer architects must be creative to amortize the von Neumann Bottleneck (VNB) associated with fetching and decoding instructions which only keep the datapath busy for a very short period of time.

 

The strong emergence of embarrassingly-parallel workloads, such as data analytics and machine learning, created a major window of opportunity for architectures that effectively exploit Data Level Parallelism (DLP) to achieve energy efficiency. Unfortunately, no single architecture can effectively answer the increasing computing requirements of all modern workloads. Parallel multi-core computing is the de facto standard for exploiting parallelism while keeping a flexible programming model. However, they do not exploit the workload’s DLP characteristics and fail to address the VNB effectively.

 

In this thesis, we will assess architectural models that relax the constraints imposed by the VNB. We will focus on architectures based on a cluster of simple cores sharing low-latency L1 memory. Being on the streaming multiprocessors of Graphics Processing Units (GPUs), large-scale manycore clusters, or the computing cluster of embedded processors, the shared-L1 cluster is a simple albeit ubiquitous architectural element. Therefore, maximizing its efficiency is the key challenge for improving the overall system’s energy efficiency.

 

Vector processors promise to tackle the VNB by amortizing the energy overhead of instruction fetching and decoding over several chunks of data. They also keep the flexibility of being a programmable architecture. With Ara, we exploit vector Single Instruction, Multiple Data (SIMD) on the application-class domain with a large-scale Vector Processing Unit (VPU), which can orchestrate up to 16doubleprecision Floating Point Units (FPUs). Ara is part of a larger revival of interest vector architectures after decades of neglect and was one of the first open-source vector processing units based on the RISC-V Vector Extension (RVV) Instruction Set Architecture (ISA). Furthermore, we propose Spatz, a modern, compact vector machine that targets DLP before integrating it into a shared-L1 cluster. According to our mathematical model of the cluster’s energy consumption, a very small Vector Register File (VRF) is needed to optimally balance the energy cost due to the traffic between the L0 VRF and the L1 Scratchpad Memory (SPM). Despite using a VRF much smaller than typical vector processors, we achieve state-of-the-art performance.

 

In the embedded domain, there is rarely the need for large computational units such as double-precision FPUs. However, there is a growing need for flexible edge computation. Instead of replicating the shared-L1 cluster—which usually only scales to the low tens of cores—and connecting them with some latency-tolerant interconnect, we tried to push the core count of the cluster to its limit. Our objective with MemPool, a shared-L1 cluster with 256cores and 1MiB of L1 memory, is a highly-flexible architecture that can also meet the ever-stricter computational requirements of edge computing.

 

Unfortunately, MemPool’s flexibility has a price, and using scalar cores as its computation units makes the design a prime target for VNB-related limitations. In this thesis, we explore for the first time vector processing as an option to build small and efficient Processing Elements (PEs) for large-scale shared-L1 clusters. As the PE of MemPool, Spatz alleviates its VNB by adding a VRF which acts as an L0 memory level, thereby reducing the traffic on the interconnects.

 

To summarize, this thesis explores the vector processing abstraction as an answer to the VNB on modern computational systems, from the High-Performance Computing (HPC) to the edge domain. As part of a revival of interest in the vector-SIMD abstraction, our open-sourced accelerators highlight the soundness of the vector approach without resorting to overspecialized computing architectures.

 

 

Keywords: vector processing, high-performance computing, vertical integration, VLSI, open-sourced accelerators, von Neumann, Bottleneck

 

 

About the Author:

 

Matheus de Araújo Cavalcante was born the 6th November, 1995 in Campina Grande, Paraíba, Brazil. He received the M.Sc. degree in Integrated Electronic Systems from the Grenoble Institute of Technology (Phelma), Grenoble, France in 2018. Mr. Cavalcante successively joined the research group of Prof. Dr. Luca Benini at the Integrated Systems Laboratory (IIS) as a Ph.D. candidate. His research interests include high-performance computing systems, with particular interest in vector processors and manycore systems. He furthermore works in architectural co-optimization with emerging VLSI technologies, such as two-and-a-half and three-dimensional integrated circuits.

 

Series in Microelectronics

Direkt bestellen bei / to order directly from:

Hartung-Gorre Verlag / D-78465 Konstanz / Germany

Telefon: +49 (0) 7533 97227  Telefax: +49 (0) 7533 97228
http://www.hartung-gorre.de   eMail: verlag@hartung-gorre.de