Principal Engineer - Systems for ML Inference and Training Optimization, Deep Science for Syste[...]

Amazon Web Services (AWS) Location Not Available

Stellenbeschreibung:

Principal Engineer – Systems for ML Inference and Training Optimization, Deep Science for Systems and Services

Join Amazon Web Services (AWS) as a Principal Engineer focused on ML systems, training and inference optimization. This role leads the technical strategy and implementation of next‑generation AI performance at scale across distributed training and high‑throughput inference serving.

Overview

You will set the technical direction for kernel‑level optimizations, define architectural strategies for heterogeneous compute platforms, architect multi‑GPU and multi‑node training systems, and deliver solutions that fundamentally change how AWS serves ML training and inference workloads.

Key Responsibilities

Define and drive the technical strategy and architectural roadmap for ML inference and training optimization across teams.
Lead the design and architecture of kernel‑level optimizations spanning NVIDIA GPUs, AWS Inferentia/Trainium, and emerging AI accelerators.
Tackle the most difficult performance challenges, achieving order‑of‑magnitude improvements (10× or greater).
Drive the design, implementation, and delivery of performance solutions at program level that address large or endemic customer and business problems.
Establish deep understanding of new SoCs, GPUs, and AI accelerators; influence hardware selection decisions based on facts‑driven analysis.
Set the standard for engineering excellence; create mechanisms, tools, and processes that enable performance measurement, analysis, and optimization at scale across teams.
Guide the career growth of senior engineers; mentor and develop the next generation of performance engineering leaders.
Remain a practitioner – personally writing critical‑path code, designing zero‑overhead portable libraries, and prototyping solutions that inform technical direction for the organization.

Qualifications

10+ years of software development experience with demonstrated progression in technical leadership and impact.
Expert‑level proficiency in C/C++ and low‑level systems programming with proven track record of delivering order‑of‑magnitude performance improvements in production systems.
Extensive experience with CUDA programming, GPU architecture, assembly‑level optimization, and kernel development across multiple hardware platforms.
Demonstrated ability to lead organization‑level technical initiatives spanning multiple teams, building consensus on contentious technical decisions and driving architectural strategy.
Experience defining technical roadmaps, conducting performance analysis and resource budgeting, and translating system analysis into strategic development plans.
Master’s degree (or higher) in Computer Science, Computer Engineering, or related technical field with 15+ years of performance engineering experience (preferred).
Experience optimizing ML inference and/or training workloads across diverse hardware platforms (GPUs, AWS Neuron/Inferentia, and other accelerators).
Deep expertise across multiple hardware architectures and platforms (x86, ARM, multiple GPU generations, SoCs, custom accelerators) with ability to quickly master new hardware platforms.
Track record of developing portable, high‑performance libraries, tools, or frameworks used across engineering organizations or open‑source projects with significant adoption.
Proven ability to establish deep understanding of complex systems and create performance measurement/analysis tools that provide critical insights for organization‑wide use.
Entrepreneurial experience including startup founding, CTO role, or driving technical vision in product development environments (preferred).

About the Team

Deep Science for Systems and Services (DS3) is a science organization within AWS Compute & ML Services focused on advancing AI/ML technologies at the systems level. Our team develops optimizations for large model inference across diverse hardware platforms, delivering order‑of‑magnitude performance improvements and industry‑leading cost‑performance for AWS customers deploying AI at scale.

Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build.

m/w/d

Seniority Level

Director

Employment Type

Full‑time

Job Function

Information Technology, Consulting, and Engineering

Industries

IT Services and IT Consulting

#J-18808-Ljbffr

NOTE / HINWEIS:

EN: Please refer to Fuchsjobs for the source of your application

DE: Bitte erwähne Fuchsjobs, als Quelle Deiner Bewerbung

Stelleninformationen

Typ:
Vollzeit
Arbeitsmodell:
Vor Ort
Kategorie:
Erfahrung:
2+ years
Arbeitsverhältnis:
Angestellt
Veröffentlichungsdatum:
18 Nov 2025
Standort:

KI Suchagent

Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!

Principal Engineer - Systems for ML Inference and Training Optimization, Deep Science for Syste[...]

Stellenbeschreibung:

Principal Engineer – Systems for ML Inference and Training Optimization, Deep Science for Systems and Services

Overview

Key Responsibilities

Qualifications

About the Team

Amazon is an equal opportunities employer. We believe passionately that employing a diverse workforce is central to our success. We make recruiting decisions based on your experience and skills. We value your passion to discover, invent, simplify and build.

m/w/d

Seniority Level

Employment Type

Job Function

Industries

Stelleninformationen

KI Suchagent

ähnliche Jobs

Principal Engineer for System Architectures eSIM & Mobile Security (f/m/div)

System Engineer Training Simulation (w/m/d)

Principal System Engineer

Systems engineer for Aircraft Secondary Power System

System Engineer (f/m/d) for Real-Time Simulation of HVDC and/or FACTS-Systems

Principal Engineer - Systems for ML Inference and Training Optimization, Deep Science for Syste[...]

Senior Applied Scientist - Systems for ML Inference and Training Optimization, Deep Science for[...]

System Engineer Passenger and Safety Relevant Systems (w/d/m)