Adaptive block-scaled GeMMs on vector processors for DNN training at the edge

Satya Murthy, Nitish; Laubeuf, Nathan; Bhattacharjee, Debjyoti; Catthoor, Francky; Verhelst, Marian

doi:10.1109/VLSI-SOC62099.2024.10767806

Reduced precision datatypes have become essential to the efficient training and deployment of Deep Neural Networks (DNNs). A recent development in the field has been the emergence of block-scaled datatypes: tensor representation formats derived from floating-point, that share a common exponent across multiple elements. While these formats are being broadly adopted and optimised for by DNN-specific inference accelerators, the potential benefits for training workloads on general-purpose (GP) vector processors has yet to be thoroughly explored. This work proposes a benchmarked implementation of block-scaled general matrix multiplications (GeMM) for DNN training at the edge using commercially available vector instruction sets (ARM SVE). Using this implementation, we highlight an accuracy-speed trade-off involving the shape of shared exponent blocks

vectors or squares. We exploit this result to optimize the training of fully connected networks by dynamically adapting the shared exponent block shapes during training. This strategy yields on average around 1.95x faster training with 2x lower memory footprint compared to standard IEEE 32-bit floating point (FP32), while achieving similar accuracy.

Adaptive block-scaled GeMMs on vector processors for DNN training at the edge

Date

Author(s)

Journal

Abstract

Description

Statistics

Views

Citations

Statistics

Views

Citations