Case 01
CUDA GEMM Optimization and Architectural Analysis
Independent Researcher / Mar 2026
Implemented and systematically optimized GEMM kernels while studying how memory hierarchy and arithmetic intensity shape end-to-end execution performance.
- Used 2D block tiling and shared memory to improve data reuse within a thread block.
- Applied register blocking to raise arithmetic intensity from 7.2 to 14.1 FLOPs/Byte.
- Used Nsight Compute to confirm major DRAM traffic reduction and 3.57x overall speedup.