Amazon Web Services (AWS)

Principal Engineer - Systems for ML Inference and Training Optimization, Deep Science for Syste[...]

Amazon Web Services (AWS) Location Not Available

Stellenbeschreibung:

Principal Engineer – Systems for ML Inference and Training Optimization, Deep Science for Systems and Services

Join Amazon Web Services (AWS) as a Principal Engineer focused on ML systems, training and inference optimization. This role leads the technical strategy and implementation of next‑generation AI performance at scale across distributed training and high‑throughput inference serving.

Overview

You will set the technical direction for kernel‑level optimizations, define architectural strategies for heterogeneous compute platforms, architect multi‑GPU and multi‑node training systems, and deliver solutions that fundamentally change how AWS serves ML training and inference workloads.

Key Responsibilities

  • Define and drive the technical strategy and architectural roadmap for ML inference and training optimization across teams.
  • Lead the design and architecture of kernel‑level optimizations spanning NVIDIA GPUs, AWS Inferentia/Trainium, and emerging AI accelerators.
  • Tackle the most difficult performance challenges, achieving order‑of‑magnitude improvements (10× or greater).
  • Drive the design, implementation, and delivery of performance solutions at program level that address large or endemic customer and business problems.
  • Establish deep understanding of new SoCs, GPUs, and AI accelerators; influence hardware selection decisions based on facts‑driven analysis.
  • Set the standard for engineering excellence; create mechanisms, tools, and processes that enable performance measurement, analysis, and optimization at scale across teams.
  • Guide the career growth of senior engineers; mentor and develop the next generation of performance engineering leaders.
  • Remain a practitioner – personally writing critical‑path code, designing zero‑overhead portable libraries, and prototyping solutions that inform technical direction for the organization.

Qualifications

  • 10+ years of software development experience with demonstrated progression in technical leadership and impact.
  • Expert‑level proficiency in C/C++ and low‑level systems programming with proven track record of delivering order‑of‑magnitude performance improvements in production systems.
  • Extensive experience with CUDA programming, GPU architecture, assembly‑level optimization, and kernel development across multiple hardware platforms.
  • Demonstrated ability to lead organization‑level technical initiatives spanning multiple teams, building consensus on contentious technical decisions and driving architectural strategy.
  • Experience defining technical roadmaps, conducting performance analysis and resource budgeting, and translating system analysis into strategic development plans.
  • Master’s degree (or higher) in Computer Science, Computer Engineering, or related technical field with 15+ years of performance engineering experience (preferred).
  • Experience optimizing ML inference and/or training workloads across diverse hardware platforms (GPUs, AWS Neuron/Inferentia, and other accelerators).
  • Deep expertise across multiple hardware architectures and platforms (x86, ARM, multiple GPU generations, SoCs, custom accelerators) with ability to quickly master new hardware platforms.
  • Track record of developing portable, high‑performance libraries, tools, or frameworks used across engineering organizations or open‑source projects with significant adoption.
  • Proven ability to establish deep understanding of complex systems and create performance measurement/analysis tools that provide critical insights for organization‑wide use.
  • Entrepreneurial experience including startup founding, CTO role, or driving technical vision in product development environments (preferred).

About the Team

Deep Science for Systems and Services (DS3) is a science organization within AWS Compute & ML Services focused on advancing AI/ML technologies at the systems level. Our team develops optimizations for large model inference across diverse hardware platforms, delivering order‑of‑magnitude performance improvements and industry‑leading cost‑performance for AWS customers deploying AI at scale.

Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build.

m/w/d

Seniority Level

Director

Employment Type

Full‑time

Job Function

Information Technology, Consulting, and Engineering

Industries

IT Services and IT Consulting

#J-18808-Ljbffr
NOTE / HINWEIS:
EnglishEN: Please refer to Fuchsjobs for the source of your application
DeutschDE: Bitte erwähne Fuchsjobs, als Quelle Deiner Bewerbung

Stelleninformationen

  • Typ:

    Vollzeit
  • Arbeitsmodell:

    Vor Ort
  • Kategorie:

  • Erfahrung:

    2+ years
  • Arbeitsverhältnis:

    Angestellt
  • Veröffentlichungsdatum:

    18 Nov 2025
  • Standort:

KI Suchagent

AI job search

Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!