AI Trace Generation Engineer

Stellenbeschreibung:

Your mission

To support our growing team, we are looking for an experienced AI Trace Generation Engineer to join us. In this role, you will take both a strategic and hands‑on approach to designing and building systems that enable deep visibility into distributed AI workloads. This includes developing trace collection, instrumentation, and simulation capabilities that help optimize performance across large‑scale, multi‑GPU environments. You will work at the intersection of machine learning and systems engineering, contributing to the core infrastructure powering next‑generation AI workloads.

Your mission

Design and implement a tracecollectionsystem for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
Validate that collected traces accurately reflect realworkloadbehavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
Integrate with and instrument major LLMframeworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
Use collected traces as input to discreteeventsimulations that model and replay distributed AI workload behavior at scale
Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns

Your profile

3+ years of experience in AI systems, ML infrastructure, or a closely related area
Hands‑on experience with at least one major LLMservingortrainingframework
Strong proficiency in PythonandC++, with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute‑bound and memory‑bound operations
Solid understanding of distributedcommunication
Familiarity with parallelismstrategies and how they shape execution behavior across large clusters

Nice to have

Open source contributions or published research in relevant areas
Experience in startup environments, with the ability to move quickly, navigate ambiguity, and take ownership

Why us?

Buildsomethingbig: Help build and scale a fast-growing AI infrastructure startup
Pay&perks: Competitive compensation with a performance‑based incentive, subsidized Deutschlandticket, and access to a discount portal
Workyourway: Flexible hours with hybrid and remote-friendly options
Fastlanes,noredtape: Flat hierarchies and rapid decision‑making mean ideas ship quickly
Globalteam: Work with a diverse, international team across Germany and the USA
Modernheadquarters: Well-stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
Topsetup: Your choice of high‑quality hardware and equipment
Relocationsupport: We’ll help make your move to join us as smooth as possible