DeepRec.ai

Research Engineer - Training Optimization & Infrastructure

DeepRec.ai WorkFromHome

Stellenbeschreibung:

Research Engineer – Training Optimisation and Infrastructure

Location : Berlin - Remote within Europe (±2 hours CET)

Level : Mid to Staff

Package : Competitive salary plus equity

The Opportunity

A Series A generative AI company is hiring a Research Engineer to drive optimisation across training strategy and ML infrastructure. The business builds state-of-the-art audio and music generation models and is backed by a leading generative AI fund. The team includes researchers and engineers from Google Brain, Meta FAIR, Amazon, ETH Zürich, and Max Planck.

Role Summary

You will focus on optimising end-to-end training pipelines for large generative models. This includes GPU-level performance tuning, distributed systems work, and driving efficiency across data, storage, orchestration, and experimentation systems.

Key Responsibilities

  • Develop and refine training strategies including parallelism approaches and precision choices for varied model scales and compute profiles
  • Profile, debug, and optimise single and multi-GPU workloads using tools such as Nsight
  • Improve training pipelines covering data storage, data loading, distributed training, checkpointing, and logging
  • Build scalable systems for experiment tracking, model and data versioning, and experiment insights
  • Design, deploy, and maintain large-scale training clusters using SLURM

Ideal Experience

  • Strong hands-on experience optimising training and inference workloads
  • Deep understanding of GPU memory hierarchy and hardware performance limits
  • Experience tuning both memory-bound and compute-bound operations
  • Knowledge of efficient attention algorithms and their performance implications at different scales

Nice-to-Have

  • Experience writing custom GPU kernels and integrating them into PyTorch
  • Familiarity with diffusion or autoregressive models
  • Understanding of high-performance storage solutions such as VAST
  • Experience running SLURM clusters at scale

Why Apply

  • Work on frontier audio and music generation models
  • Influence training strategy and infrastructure at scale
  • Join a high-calibre research and engineering team

#J-18808-Ljbffr
NOTE / HINWEIS:
EnglishEN: Please refer to Fuchsjobs for the source of your application
DeutschDE: Bitte erwähne Fuchsjobs, als Quelle Deiner Bewerbung

Stelleninformationen

  • Typ:

    Vollzeit
  • Arbeitsmodell:

    Vor Ort
  • Kategorie:

  • Erfahrung:

    2+ years
  • Arbeitsverhältnis:

    Angestellt
  • Veröffentlichungsdatum:

    10 Dez 2025
  • Standort:

    WorkFromHome

KI Suchagent

AI job search

Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!