Projects — Chora The space where the abstract takes form.
-
CUDA SGEMM Optimization
An iterative, hardware-aware optimization of single-precision matrix multiplication (SGEMM) on NVIDIA Ampere & Hopper architectures, achieving 94% of cuBLAS peak performance. Implements multi-stage tiling, double-buffered SMEM pipelining, and warp-specialized instruction scheduling.
-
Compact DSL for GPU Kernels
A lightweight embedded domain-specific language (DSL) written in C++20 for generating highly optimized Tensor Core operations. Encodes tile dimensions in the type system for compile-time layout verification and zero-overhead abstraction.
-
Distributed Pipeline Parallel Harness
High-performance training harness designed to optimize multi-node distributed setups for deep learning models. Implements pipeline and tensor parallelism with activation checkpointing, communication-computation overlap, and automatic microbatch scheduling.