RESEARCH & DEVELOPMENT • GPU COMPUTING • DEEP LEARNING KERNELS
Hi! I am Achille Triomphe. I design, compile, and optimize high-performance AI kernels and deep learning systems. This space serves as a scientific worklog where I share in-depth code dissections, mathematical proofs, and optimizations close to the GPU metal.
Peripatos — Thinking is done in motion.
-
LoRA Without Regret
An exploration of when low-rank adaptation matches full fine-tuning, covering learning rate invariance, capacity limits, and the geometry of parameter updates.
-
Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels
A deep dive into Stanford's embedded DSL for CUDA, exploring how its 16×16 tile abstractions map directly to Tensor Cores, shared memory, and warp-group MMA on Hopper.
Chora — The space where the abstract takes form.
CUDA SGEMM Optimization
An iterative, hardware-aware optimization of single-precision matrix multiplication (SGEMM) on NVIDIA Ampere & Hopper architectures, achieving 94% of cuBLAS peak performance. Implements multi-stage tiling, double-buffered SMEM pipelining, and warp-specialized instruction scheduling.
Compact DSL for GPU Kernels
A lightweight embedded domain-specific language (DSL) written in C++20 for generating highly optimized Tensor Core operations. Encodes tile dimensions in the type system for compile-time layout verification and zero-overhead abstraction.
Distributed Pipeline Parallel Harness
High-performance training harness designed to optimize multi-node distributed setups for deep learning models. Implements pipeline and tensor parallelism with activation checkpointing, communication-computation overlap, and automatic microbatch scheduling.