AI Trace Generation Engineer

Stellenbeschreibung:

Design and implement a trace collection system for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
Validate that collected traces accurately reflect real workload behavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
Integrate with and instrument major LLM frameworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
Use collected traces as input to discrete event simulations that model and replay distributed AI workload behavior at scale
Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns

3+ years of experience in AI systems, ML infrastructure , or a closely related area
Hands‑on experience with at least one major LLM serving or training framework
Strong proficiency in Python and C++ , with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute‑bound and memory‑bound operations
Solid understanding of distributed communication
Familiarity with parallelism strategies and how they shape execution behavior across large clusters
Open source contributions or published research in relevant areas will definitely be appreciated!
Previous startup experience is a plus - we move fast and value people who are comfortable with that

Build something big: Help build and scale a fast‑growing AI infrastructure startup
Pay & perks: Competitive compensation with a performance‑based incentive, subsidized Deutschlandticket, and access to a discount portal
Work your way: Flexible hours with hybrid and remote‑friendly options
Fast lanes, no red tape: Flat hierarchies and rapid decision‑making mean ideas ship quickly
Global team: Work with a diverse, international team across Germany and the USA
Modern headquarters: Well‑stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
Top setup: Your choice of high‑quality hardware and equipment
Relocation support: We’ll help make your move to join us as smooth as possible