Research Engineer - Training Optimization & Infrastructure

DeepRec.ai WorkFromHome

Stellenbeschreibung:

Research Engineer – Training Optimisation and Infrastructure

Location : Berlin - Remote within Europe (±2 hours CET)

Level : Mid to Staff

Package : Competitive salary plus equity

The Opportunity

A Series A generative AI company is hiring a Research Engineer to drive optimisation across training strategy and ML infrastructure. The business builds state-of-the-art audio and music generation models and is backed by a leading generative AI fund. The team includes researchers and engineers from Google Brain, Meta FAIR, Amazon, ETH Zürich, and Max Planck.

Role Summary

You will focus on optimising end-to-end training pipelines for large generative models. This includes GPU-level performance tuning, distributed systems work, and driving efficiency across data, storage, orchestration, and experimentation systems.

Key Responsibilities

Develop and refine training strategies including parallelism approaches and precision choices for varied model scales and compute profiles
Profile, debug, and optimise single and multi-GPU workloads using tools such as Nsight
Improve training pipelines covering data storage, data loading, distributed training, checkpointing, and logging
Build scalable systems for experiment tracking, model and data versioning, and experiment insights
Design, deploy, and maintain large-scale training clusters using SLURM

Ideal Experience

Strong hands-on experience optimising training and inference workloads
Deep understanding of GPU memory hierarchy and hardware performance limits
Experience tuning both memory-bound and compute-bound operations
Knowledge of efficient attention algorithms and their performance implications at different scales