AI Trace Generation Engineer

Stellenbeschreibung:

Your mission

To support our growing team, we are looking for an experienced AI Trace Generation Engineer to join us. In this role, you will take both a strategic and hands‑on approach to designing and building systems that enable deep visibility into distributed AI workloads. This includes developing trace collection, instrumentation, and simulation capabilities that help optimize performance across large‑scale, multi‑GPU environments. You will work at the intersection of machine learning and systems engineering, contributing to the core infrastructure powering next‑generation AI workloads.

Your mission

  • Design and implement a tracecollectionsystem for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
  • Validate that collected traces accurately reflect realworkloadbehavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
  • Integrate with and instrument major LLMframeworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
  • Use collected traces as input to discreteeventsimulations that model and replay distributed AI workload behavior at scale
  • Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns

Your profile

  • 3+ years of experience in AI systems, ML infrastructure, or a closely related area
  • Hands‑on experience with at least one major LLMservingortrainingframework
  • Strong proficiency in PythonandC++, with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute‑bound and memory‑bound operations
  • Solid understanding of distributedcommunication
  • Familiarity with parallelismstrategies and how they shape execution behavior across large clusters

Nice to have

  • Open source contributions or published research in relevant areas
  • Experience in startup environments, with the ability to move quickly, navigate ambiguity, and take ownership

Why us?

  • Buildsomethingbig: Help build and scale a fast-growing AI infrastructure startup
  • Pay&perks: Competitive compensation with a performance‑based incentive, subsidized Deutschlandticket, and access to a discount portal
  • Workyourway: Flexible hours with hybrid and remote-friendly options
  • Fastlanes,noredtape: Flat hierarchies and rapid decision‑making mean ideas ship quickly
  • Globalteam: Work with a diverse, international team across Germany and the USA
  • Modernheadquarters: Well-stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
  • Topsetup: Your choice of high‑quality hardware and equipment
  • Relocationsupport: We’ll help make your move to join us as smooth as possible

#J-18808-Ljbffr
NOTE / HINWEIS:
EnglishEN: Please refer to Fuchsjobs for the source of your application
DeutschDE: Bitte erwähne Fuchsjobs, als Quelle Deiner Bewerbung

Stelleninformationen

  • Veröffentlichungsdatum:

    02 Mai 2026
  • Standort:

    Heidelberg
  • Typ:

    Vollzeit
  • Arbeitsmodell:

    Vor Ort
  • Kategorie:

  • Erfahrung:

    2+ years
  • Arbeitsverhältnis:

    Angestellt

KI Suchagent

AI job search

Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!