Senior Performance Engineer- Pretraining (f/m/d)

Stellenbeschreibung:

Our Mission

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre‑training. Our customers - in finance, manufacturing, public administration - need models that understand German, meet European regulatory requirements, and work reliably in high‑stakes settings. We're building that in Heidelberg.

Team Culture

At Aleph Alpha, we foster a culture built on ownership, autonomy, and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organizational structure with efficient, supportive management that enables quick decision‑making, open communication, and a strong sense of shared purpose.

About The Role

You will engineer the systems required to train foundation models at scale. Your objective is to maximize hardware utilization and training throughput on our large‑scale GPU clusters (thousands of NVIDIA Blackwell GPUs). You will work at the intersection of deep learning frameworks, distributed systems, and GPU microarchitecture, eliminating bottlenecks from the Python layer down to the GPU kernel.

This role is for Aleph Alpha Research GmbH.

Your Responsibilities

End‑to‑End Optimization: Profile training loops using PyTorch Profiler, Nsight Systems and Nsight Compute to identify system‑ and kernel‑level bottlenecks in order to maximize model throughput.
Distributed Strategy and Topology: Configure and tune composite parallelism strategies (e.g. TP, DP, HSDP/FSDP, EP), optimizing load balance, minimizing critical‑path bottlenecks, and managing communication‑to‑computation trade‑offs for large‑scale LLM training.
Hardware‑Aware Modeling: Partner with AI Researchers to define model architectures for hardware efficiency without compromising convergence.

Your Profile

Basic Qualifications

Proficient in Python and the PyTorch library.
Strong engineering background in parallel and/or distributed systems with proven track record of excellence.
Hands‑on experience with modern machine learning techniques (especially large language models and their life cycle).
Deeply understand the CUDA programming model.
Experience in distributed programming with APIs like NCCL or MPI.
Experience analysing profiling traces with tools such as PyTorch Profiler and Nvidia Nsight.
Please note this role requires regular on‑site collaboration in Heidelberg as a member of the Training Efficiency Team.

Preferred Qualifications

Contributions to modern distributed training frameworks (e.g., TorchTitan, Megatron‑LM, DeepSpeed).
Familiarity with low‑precision training formats (MXFP4, MXFP8) and their impact on numerical stability and throughput.
A deep understanding of NCCL communication primitives, NVSHMEM or CUDA IPC and their performance.
A proven track record of implementing and optimising modern transformer‑based model training.
A proven track record working on the NVIDIA Blackwell architecture.