Home | Achille Triomphe

ACHILLE TRIOMPHE

RESEARCH & DEVELOPMENT • GPU COMPUTING • DEEP LEARNING KERNELS

Hi! I am Achille Triomphe. I design, compile, and optimize high-performance AI kernels and deep learning systems. This space serves as a scientific worklog where I share in-depth code dissections, mathematical proofs, and optimizations close to the GPU metal.

Peripatos — Thinking is done in motion.

Jun 1, 2026

LoRA Without Regret

An exploration of when low-rank adaptation matches full fine-tuning, covering learning rate invariance, capacity limits, and the geometry of parameter updates.
May 21, 2026

Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels

A deep dive into Stanford's embedded DSL for CUDA, exploring how its 16×16 tile abstractions map directly to Tensor Cores, shared memory, and warp-group MMA on Hopper.

Chora — The space where the abstract takes form.

CUDA SGEMM Optimization

An iterative, hardware-aware optimization of single-precision matrix multiplication (SGEMM) on NVIDIA Ampere & Hopper architectures, achieving 94% of cuBLAS peak performance. Implements multi-stage tiling, double-buffered SMEM pipelining, and warp-specialized instruction scheduling.

CUDA C++SASS Assembly NSight Compute GPGPU Architecture

Compact DSL for GPU Kernels

A lightweight embedded domain-specific language (DSL) written in C++20 for generating highly optimized Tensor Core operations. Encodes tile dimensions in the type system for compile-time layout verification and zero-overhead abstraction.

C++20 Templates NVIDIA PTX Assembly Compiler Design LLVM

Distributed Pipeline Parallel Harness

High-performance training harness designed to optimize multi-node distributed setups for deep learning models. Implements pipeline and tensor parallelism with activation checkpointing, communication-computation overlap, and automatic microbatch scheduling.

Python PyTorch Core NCCL Distributed Systems

Peripatos — Thinking is done in motion.

LoRA Without Regret

Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels

Chora — The space where the abstract takes form.

CUDA SGEMM Optimization

Compact DSL for GPU Kernels

Distributed Pipeline Parallel Harness