Publication:
Reduced precision datatypes have become essential to the efficient training and deployment of Deep Neural Networks (DNNs). A recent development in the field has been the emergence of block-scaled datatypes: tensor representation formats derived from floating-point, that share a common exponent across multiple elements. While these formats are being broadly adopted and optimised for by DNN-specific inference accelerators, the potential benefits for training workloads on general-purpose (GP) vector processors has yet to be thoroughly explored. This work proposes a benchmarked implementation of block-scaled general matrix multiplications (GeMM) for DNN training at the edge using commercially available vector instruction sets (ARM SVE). Using this implementation, we highlight an accuracy-speed trade-off involving the shape of shared exponent blocks