ACHILLE TRIOMPHE

RESEARCH & DEVELOPMENT • GPU COMPUTING • DEEP LEARNING KERNELS

Hi! I am Achille Triomphe. I design, compile, and optimize high-performance AI kernels and deep learning systems. This space serves as a scientific worklog where I share in-depth code dissections, mathematical proofs, and optimizations close to the GPU metal.

Peripatos — Thinking is done in motion.

Chora — The space where the abstract takes form.

CUDA SGEMM Optimization

An iterative, hardware-aware optimization of single-precision matrix multiplication (SGEMM) on NVIDIA Ampere & Hopper architectures, achieving 94% of cuBLAS peak performance. Implements multi-stage tiling, double-buffered SMEM pipelining, and warp-specialized instruction scheduling.

Compact DSL for GPU Kernels

A lightweight embedded domain-specific language (DSL) written in C++20 for generating highly optimized Tensor Core operations. Encodes tile dimensions in the type system for compile-time layout verification and zero-overhead abstraction.

Distributed Pipeline Parallel Harness

High-performance training harness designed to optimize multi-node distributed setups for deep learning models. Implements pipeline and tensor parallelism with activation checkpointing, communication-computation overlap, and automatic microbatch scheduling.