2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Abstract
The rise of Deep Neural Networks (DNNs) has amplified the demand for efficient computation, with General Matrix Multiply (GEMM) operations at their core. While ASICs are efficient but inflexible, GPUs, especially NVIDIA GPUs, equipped with tensor cores, provide a flexible yet high-performance solution for GEMM-based workloads. Previous research and optimizations have largely centered on NVIDIA’s architecture and programming model, which, while effective, can obscure the rationale behind certain design decisions and limit flexibility for further improvements in tensor core designs. In this paper, we present the design and integration of a tensor core into the open-source RISC-V Vortex GPGPU platform, along with a suite of intrinsics designed for GEMM kernel generation. The analysis conducted on the integration elucidates the connections between GPU system architectural parameters and tensor core configuration. We find that the tensor core is severely under-utilized in many cases and that increased compute capacity does not always imply better performance. Hence, we propose a novel technique, cooperative warp execution in tensor core, which leverages hardware-supported warp cooperation within the tensor core to reduce memory requirements for GEMM operations and boost performance over the baseline tensor core implementation by up to 3x.