Publication:

Cooperative Warp Execution in Tensor Core for RISC-V GPGPU

 
dc.contributor.authorNada, Abubakr
dc.contributor.authorSarda Giuseppe Maria
dc.contributor.authorLenormand, Erwan
dc.contributor.imecauthorNada, Abubakr
dc.contributor.imecauthorSarda, Giuseppe Maria
dc.contributor.imecauthorLenormand, Erwan
dc.contributor.orcidimecNada, Abubakr::0009-0001-4019-9275
dc.contributor.orcidimecLenormand, Erwan::0000-0002-7383-6285
dc.date.accessioned2025-08-15T03:57:02Z
dc.date.available2025-08-15T03:57:02Z
dc.date.issued2025
dc.description.abstractThe rise of Deep Neural Networks (DNNs) has amplified the demand for efficient computation, with General Matrix Multiply (GEMM) operations at their core. While ASICs are efficient but inflexible, GPUs, especially NVIDIA GPUs, equipped with tensor cores, provide a flexible yet high-performance solution for GEMM-based workloads. Previous research and optimizations have largely centered on NVIDIA’s architecture and programming model, which, while effective, can obscure the rationale behind certain design decisions and limit flexibility for further improvements in tensor core designs. In this paper, we present the design and integration of a tensor core into the open-source RISC-V Vortex GPGPU platform, along with a suite of intrinsics designed for GEMM kernel generation. The analysis conducted on the integration elucidates the connections between GPU system architectural parameters and tensor core configuration. We find that the tensor core is severely under-utilized in many cases and that increased compute capacity does not always imply better performance. Hence, we propose a novel technique, cooperative warp execution in tensor core, which leverages hardware-supported warp cooperation within the tensor core to reduce memory requirements for GEMM operations and boost performance over the baseline tensor core implementation by up to 3x.
dc.description.wosFundingTextWe would like to thank the reviewers who provided us with valuable feedback and our shepherd for their guidance throughout the review period. This project was partially funded by the Flanders AI Research Program.
dc.identifier.doi10.1109/HPCA61900.2025.00107
dc.identifier.eisbn979-8-3315-0647-6
dc.identifier.isbn979-8-3315-0648-3
dc.identifier.issn1530-0897
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/46074
dc.publisherIEEE COMPUTER SOC
dc.source.beginpage1422
dc.source.conference2025 International Symposium on High Performance Computer Architecture-HPCA-Annual
dc.source.conferencedate2025-03-01
dc.source.conferencelocationLas Vegas
dc.source.endpage1436
dc.source.journal2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)
dc.source.numberofpages15
dc.title

Cooperative Warp Execution in Tensor Core for RISC-V GPGPU

dc.typeProceedings paper
dspace.entity.typePublication
Files

Original bundle

Name:
Cooperative_Warp_Execution_in_Tensor_Core_for_RISC-V_GPGPU.pdf
Size:
793.11 KB
Format:
Adobe Portable Document Format
Description:
Published
Publication available in collections: