Hartung-Gorre Verlag

Inh.: Dr. Renate Gorre

D-78465 Konstanz

Fon: +49 (0)7533 97227

Fax: +49 (0)7533 97228



Series in Microelectronics

edited by       Qiuting Huang

Andreas Schenk

Mathieu Maurice Luisier

Bernd Witzigmann

Vol. 243





Fabian Thomas Schuiki


Streaming Architectures for Extreme Energy Efficiency

in High-Performance Computing


2021. XVI, 312 pages. € 64,00.

ISBN 978-3-86628-725-9


















The end of Moore’s law and the breakdown of Dennard scaling has prompted a paradigm shift in the way we approach computer architecture design. Performance at low power has become the key ingredient in achieving high utilization of available hardware in order to mitigate the effect of limited frequency and overcome dark silicon. The von Neumann bottleneck is one of the key challenges in this field: instruction fetches compete with data accesses for memory bandwidth. This bottleneck also applies to the instruction pipeline of a processor, where load-store and control instructions compete with compute instructions for issue slots. A popular way to overcome this bottleneck is to implement dedicated accelerators for a specific problem. This approach has grown ever more popular with the recent rise of machine learning. It is based on the observation that, all other things being equal, specialization in hardware always wins. However the complementary conclusion also holds: the lack of general programmability limits the accelerator’s use to a specific problem. In a time of fast-moving algorithms, today’s hardware accelerator cannot compute tomorrow’s algorithm. General purpose processors have evolved to mitigate the von Neumann bottleneck as well. One example of this is the CISC-to-RISC translation in modern processors, which can act as an instruction compression scheme. Similarly, SIMD and SIMT paradigms offer a fixed increase in computations per instruction, while Cray-style vectorization offers a more dynamic and potentially higher increase. Among the algorithms that lend themselves particularly well to such acceleration is the class of data-oblivious algorithms. These algorithms have control flow which does not depend on the data being processed, and comprise many relevant algorithms from linear algebra, machine learning, and scientific computing. This thesis develops the concept of hardware address generation and direct memory streaming as a method to mitigate the von Neumann bottleneck, applies the concept to in-order single-issue processors, allowing them to achieve full utilization of compute resources, introduces pseudo-dual-issue execution with dedicated compute hardware loops, and distills these extensions into an architectural template for high-performance computers capable of concentrating a significant part of its energy footprint in the arithmetic units.




Keywords: energy-efficient, high-performance computing



Series in Microelectronics

Direkt bestellen bei / to order directly from:

Hartung-Gorre Verlag / D-78465 Konstanz / Germany

Telefon: +49 (0) 7533 97227  Telefax: +49 (0) 7533 97228
http://www.hartung-gorre.de   eMail: verlag@hartung-gorre.de