Publication:

Adaptive block-scaled GeMMs on vector processors for DNN training at the edge

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0003-3495-9263
cris.virtual.orcid0000-0002-3599-8515
cris.virtual.orcid0000-0003-0181-8069
cris.virtual.orcid0000-0002-1592-755X
cris.virtual.orcid0000-0001-6561-8934
cris.virtualsource.departmentc84426b5-5f84-48ba-9153-1fe96862af32
cris.virtualsource.department7a992f6f-feea-493d-b4d8-c297450cff52
cris.virtualsource.department91857424-b227-471d-aaae-198ffad1e716
cris.virtualsource.department15e57581-19c6-4927-9cf5-286e171d9d9e
cris.virtualsource.department873d5ca3-d769-441b-b014-52f18a2fd1c0
cris.virtualsource.orcidc84426b5-5f84-48ba-9153-1fe96862af32
cris.virtualsource.orcid7a992f6f-feea-493d-b4d8-c297450cff52
cris.virtualsource.orcid91857424-b227-471d-aaae-198ffad1e716
cris.virtualsource.orcid15e57581-19c6-4927-9cf5-286e171d9d9e
cris.virtualsource.orcid873d5ca3-d769-441b-b014-52f18a2fd1c0
dc.contributor.authorSatya Murthy, Nitish
dc.contributor.authorLaubeuf, Nathan
dc.contributor.authorBhattacharjee, Debjyoti
dc.contributor.authorCatthoor, Francky
dc.contributor.authorVerhelst, Marian
dc.date.accessioned2026-01-14T10:57:32Z
dc.date.available2026-01-14T10:57:32Z
dc.date.issued2024
dc.description.abstractReduced precision datatypes have become essential to the efficient training and deployment of Deep Neural Networks (DNNs). A recent development in the field has been the emergence of block-scaled datatypes: tensor representation formats derived from floating-point, that share a common exponent across multiple elements. While these formats are being broadly adopted and optimised for by DNN-specific inference accelerators, the potential benefits for training workloads on general-purpose (GP) vector processors has yet to be thoroughly explored. This work proposes a benchmarked implementation of block-scaled general matrix multiplications (GeMM) for DNN training at the edge using commercially available vector instruction sets (ARM SVE). Using this implementation, we highlight an accuracy-speed trade-off involving the shape of shared exponent blocks - vectors or squares. We exploit this result to optimize the training of fully connected networks by dynamically adapting the shared exponent block shapes during training. This strategy yields on average around 1.95x faster training with 2x lower memory footprint compared to standard IEEE 32-bit floating point (FP32), while achieving similar accuracy.
dc.identifier10.1109/VLSI-SOC62099.2024.10767806
dc.identifier.doi10.1109/VLSI-SOC62099.2024.10767806
dc.identifier.isbn979-8-3315-3967-2
dc.identifier.issn2324-8440
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/58644
dc.language.isoen
dc.provenance.editstepusergreet.vanhoof@imec.be
dc.publisherIEEE
dc.relation.ispartof2024 IFIP/IEEE 32ND INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION, VLSI-SOC
dc.relation.ispartofseries2024 IFIP/IEEE 32ND INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION, VLSI-SOC
dc.source.beginpageN/A
dc.source.conferenceIFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC)
dc.source.conferencedate2024-10-06
dc.source.conferencelocationTanger
dc.source.journalIFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC)
dc.subjectDNN training
dc.subjectVector processors
dc.subjectBlock-scaled datatypes
dc.subjectARM SVE ISA
dc.subjectScience & Technology
dc.subjectTechnology
dc.title

Adaptive block-scaled GeMMs on vector processors for DNN training at the edge

dc.typeProceedings paper
dspace.entity.typePublication
oaire.citation.editionWOS.ISTP
person.identifier.ridE-5739-2011
person.identifier.ridW-6287-2019
Files
Publication available in collections: