Member of Technical Staff - Training Cluster Engineer

Black Forest Labs Inc. Freiburg im Breisgau

Stellenbeschreibung:

Overview

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters.

Role & Responsibilities

  • Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
  • Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
  • Partner with cloud and colocation providers to ensure cluster availability and performance
  • Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
  • Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
  • Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning

Required Experience

  • Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
  • Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
  • Proven track record managing GPU clusters, including driver management and DCGM monitoring

Preferred Qualifications

  • Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
  • Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
  • Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
  • Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
  • Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
  • Experience running hybrid training/inference infrastructure with appropriate resource isolation
  • Strong scripting skills (Python, Bash) and infrastructure-as-code experience

#J-18808-Ljbffr
NOTE / HINWEIS:
EnglishEN: Please refer to Fuchsjobs for the source of your application
DeutschDE: Bitte erwähne Fuchsjobs, als Quelle Deiner Bewerbung

Stelleninformationen

  • Typ:

    Vollzeit
  • Arbeitsmodell:

    Vor Ort
  • Kategorie:

  • Erfahrung:

    2+ years
  • Arbeitsverhältnis:

    Angestellt
  • Veröffentlichungsdatum:

    04 Nov 2025
  • Standort:

    Freiburg im Breisgau

KI Suchagent

AI job search

Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!