AI Trace Generation Engineer

Stellenbeschreibung:

Your mission

  • Design and implement a trace collection system for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
  • Validate that collected traces accurately reflect real workload behavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
  • Integrate with and instrument major LLM frameworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
  • Use collected traces as input to discrete event simulations that model and replay distributed AI workload behavior at scale
  • Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns

Your profile

  • 3+ years of experience in AI systems, ML infrastructure , or a closely related area
  • Hands‑on experience with at least one major LLM serving or training framework
  • Strong proficiency in Python and C++ , with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute‑bound and memory‑bound operations
  • Solid understanding of distributed communication
  • Familiarity with parallelism strategies and how they shape execution behavior across large clusters
  • Open source contributions or published research in relevant areas will definitely be appreciated!
  • Previous startup experience is a plus - we move fast and value people who are comfortable with that

Why us?

  • Build something big: Help build and scale a fast‑growing AI infrastructure startup
  • Pay & perks: Competitive compensation with a performance‑based incentive, subsidized Deutschlandticket, and access to a discount portal
  • Work your way: Flexible hours with hybrid and remote‑friendly options
  • Fast lanes, no red tape: Flat hierarchies and rapid decision‑making mean ideas ship quickly
  • Global team: Work with a diverse, international team across Germany and the USA
  • Modern headquarters: Well‑stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
  • Top setup: Your choice of high‑quality hardware and equipment
  • Relocation support: We’ll help make your move to join us as smooth as possible

#J-18808-Ljbffr
NOTE / HINWEIS:
EnglishEN: Please refer to Fuchsjobs for the source of your application
DeutschDE: Bitte erwähne Fuchsjobs, als Quelle Deiner Bewerbung

Stelleninformationen

  • Veröffentlichungsdatum:

    21 Mär 2026
  • Standort:

    WorkFromHome
  • Typ:

    Vollzeit
  • Arbeitsmodell:

    Vor Ort
  • Kategorie:

  • Erfahrung:

    2+ years
  • Arbeitsverhältnis:

    Angestellt

KI Suchagent

AI job search

Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!