Projects — Chora The space where the abstract takes form.

Completed

CUDA SGEMM Optimization

An iterative, hardware-aware optimization of single-precision matrix multiplication (SGEMM) on NVIDIA Ampere & Hopper architectures, achieving 94% of cuBLAS peak performance. Implements multi-stage tiling, double-buffered SMEM pipelining, and warp-specialized instruction scheduling.

CUDA C++SASS Assembly NSight Compute GPGPU Architecture
Active Development

Compact DSL for GPU Kernels

A lightweight embedded domain-specific language (DSL) written in C++20 for generating highly optimized Tensor Core operations. Encodes tile dimensions in the type system for compile-time layout verification and zero-overhead abstraction.

C++20 Templates NVIDIA PTX Assembly Compiler Design LLVM
Completed

Distributed Pipeline Parallel Harness

High-performance training harness designed to optimize multi-node distributed setups for deep learning models. Implements pipeline and tensor parallelism with activation checkpointing, communication-computation overlap, and automatic microbatch scheduling.

Python PyTorch Core NCCL Distributed Systems