Hartung-Gorre Verlag
Inh.: Dr.
Renate Gorre D-78465
Konstanz Fon: +49 (0)7533 97227 Fax: +49 (0)7533 97228 www.hartung-gorre.de
|
S
|
Series in
Microelectronics
edited by
Luca Benini,
Qiuting
Huang,
Taekwang Jang,
Mathieu
Luisier,
Christoph
Studer,
Hua Wang
Matheus Cavalcante
Fighting Back
the von
Neumann Bottleneck with
Small- and Large-Scale
Vector Microprocessors
1st Edition 2023. XXVIII,194 pages. € 64,00.
ISBN 978-3-86628-801-0
Abstract:
In his seminal Turing Award Lecture, Backus discussed
the issues stemming from the word-at-a-time style of programming inherited from
the von Neumann computer. More than forty years later, in a context where
Moore’s Law and Dennard’s scaling are no longer reliable sources of computer
performance and energy efficiency gains, computer architects must be creative
to amortize the von Neumann Bottleneck (VNB) associated with fetching and
decoding instructions which only keep the datapath
busy for a very short period of time.
The strong emergence of embarrassingly-parallel
workloads, such as data analytics and machine learning, created a major window of
opportunity for architectures that effectively exploit Data Level Parallelism
(DLP) to achieve energy efficiency. Unfortunately, no single architecture can
effectively answer the increasing computing requirements of all modern
workloads. Parallel multi-core computing is the de facto standard for
exploiting parallelism while keeping a flexible programming model. However,
they do not exploit the workload’s DLP characteristics and fail to address the
VNB effectively.
In this thesis, we will assess architectural models
that relax the constraints imposed by the VNB. We will focus on architectures based
on a cluster of simple cores sharing low-latency L1 memory. Being on the
streaming multiprocessors of Graphics Processing Units (GPUs), large-scale manycore clusters, or the computing cluster of embedded
processors, the shared-L1 cluster is a simple albeit ubiquitous architectural
element. Therefore, maximizing its efficiency is the key challenge for
improving the overall system’s energy efficiency.
Vector processors promise to tackle the VNB by
amortizing the energy overhead of instruction fetching and decoding over
several chunks of data. They also keep the flexibility of being a programmable architecture.
With Ara, we exploit vector Single Instruction, Multiple Data (SIMD) on the
application-class domain with a large-scale Vector Processing Unit (VPU), which
can orchestrate up to 16doubleprecision Floating Point Units (FPUs). Ara is
part of a larger revival of interest vector architectures after decades of
neglect and was one of the first open-source vector processing units based on
the RISC-V Vector Extension (RVV) Instruction Set Architecture (ISA).
Furthermore, we propose Spatz, a modern, compact
vector machine that targets DLP before integrating it into a shared-L1 cluster.
According to our mathematical model of the cluster’s energy consumption, a very
small Vector Register File (VRF) is needed to optimally balance the energy cost
due to the traffic between the L0 VRF and the L1 Scratchpad Memory (SPM).
Despite using a VRF much smaller than typical vector processors, we achieve
state-of-the-art performance.
In the embedded domain, there is rarely the need for
large computational units such as double-precision FPUs. However, there is a
growing need for flexible edge computation. Instead of replicating the
shared-L1 cluster—which usually only scales to the low tens of cores—and
connecting them with some latency-tolerant interconnect, we tried to push the
core count of the cluster to its limit. Our objective with MemPool,
a shared-L1 cluster with 256cores and 1MiB of L1 memory, is a highly-flexible
architecture that can also meet the ever-stricter computational requirements of
edge computing.
Unfortunately, MemPool’s
flexibility has a price, and using scalar cores as its computation units makes
the design a prime target for VNB-related limitations. In this thesis, we explore
for the first time vector processing as an option to build small and efficient
Processing Elements (PEs) for large-scale shared-L1 clusters. As the PE of MemPool, Spatz alleviates its VNB
by adding a VRF which acts as an L0 memory level, thereby reducing the traffic
on the interconnects.
To summarize, this thesis explores the vector
processing abstraction as an answer to the VNB on modern computational systems,
from the High-Performance Computing (HPC) to the edge domain. As part of a
revival of interest in the vector-SIMD abstraction, our open-sourced accelerators
highlight the soundness of the vector approach without resorting to
overspecialized computing architectures.
Keywords: vector processing,
high-performance computing, vertical integration, VLSI, open-sourced
accelerators, von Neumann, Bottleneck
About the Author:
Matheus de Araújo Cavalcante was born the 6th November, 1995 in Campina
Grande, Paraíba, Brazil. He received the M.Sc. degree
in Integrated Electronic Systems from the Grenoble Institute of Technology (Phelma), Grenoble, France in 2018. Mr. Cavalcante
successively joined the research group of Prof. Dr. Luca Benini
at the Integrated Systems Laboratory (IIS) as a Ph.D. candidate. His research
interests include high-performance computing systems, with particular interest
in vector processors and manycore systems. He
furthermore works in architectural co-optimization with emerging VLSI
technologies, such as two-and-a-half and three-dimensional integrated circuits.
Direkt bestellen bei / to
order directly from:
Hartung-Gorre
Verlag / D-78465 Konstanz / Germany
Telefon: +49
(0) 7533 97227 Telefax: +49 (0) 7533
97228
http://www.hartung-gorre.de eMail: verlag@hartung-gorre.de