Senior AI Platform Engineer (Multi-tenant SaaS & MLOps) (m/f/d)

Stellenbeschreibung:

Senior AI Platform Engineer (Multi-tenant SaaS & MLOps) (m/f/d)

We are seeking an experienced AI Platform Engineer to contribute to designing and building scalable SaaS products within our AI Lab. In this role, you’ll combine deep technical expertise with strategic vision to build AI-powered products that will help transform our clients’ business models and enable their growth.

Simon-Kucher is at the forefront of innovation in driving commercial excellence, revamping business models, developing solutions and methodologies for unlocking better growth of our clients. Within AI Lab, we are developing cutting‑edge large scale AI products to deliver sustained top‑line impact for our clients.

Are you interested in working in a team of AI evangelists with a can‑do attitude? Want to experience the dynamics of agile processes in open‑minded teams? How about getting creative in a startup atmosphere with a steep development curve and flat hierarchies? And most importantly, do you want to make a difference? Then you've come to the right place.

What makes us special:

Advance your career with exciting professional opportunities in our thriving company with a startup feel
Innovate by transforming ideas into cutting‑edge AI products, championing AI and Generative AI through creative experimentation to push boundaries and deliver transformative solutions.
Voice your unique ideas in a culture defined by our entrepreneurial spirit, openness, and integrity
Feel at home working with our helpful, enthusiastic colleagues who have great team spirit
Broaden your perspective with our extensive training curriculum and learning programs (e.g. LinkedIn Learning)
Speak your mind in our holistic feedback and development processes (e.g. 360‑degree feedback)
Satisfy your need for adventure with our opportunities to live and work abroad in one of our many international offices
Enjoy our benefits, such as hybrid working, daycare allowance, corporate discounts, and wellbeing support (e.g. Headspace)
Unwind in our break areas where you can help yourself to the healthy snacks and beverages provided
See another side of your coworkers at our frequent employee events, World Meetings and Holiday Parties

How you will create an impact:

Design and evolve a multi‑tenant SaaS architecture, including tenant isolation for data, computer, and observability.
Build automated tenant provisioning/onboarding, configuration, and safe rollouts (canary/feature flags) across tenants.
Implement noisy‑neighbor protection (per‑tenant quotas, rate limits, priority scheduling) and per‑tenant SLO monitoring.
Partner with security/compliance to deliver enterprise controls (audit logs, tenant‑aware access control, retention).
Develop and maintain data architecture: create and manage robust data architectures that support high‑volume, high‑throughput SaaS applications, focusing on reliability and scalability.
Drive faster and more reliable ML delivery by building robust MLOps foundations, including automated training pipelines, experiment tracking, and scalable model deployment.
Accelerate AI product development by operationalizing LLMs end‑to‑end — from fine‑tuning and evaluation to high‑performance serving, monitoring, and embeddings workflows.
Increase engineering velocity and system reliability by developing and maintaining unified CI/CD pipelines that ship ML and application code seamlessly.
Enable scalable and cost‑efficient AI workloads through well‑architected cloud infrastructure across AWS.
Improve performance and resilience of AI systems by managing Kubernetes clusters, optimizing autoscaling, and orchestrating GPU‑heavy workloads.
Enhance inference speed and portability by delivering highly optimized, secure Docker‑based containers tailored for ML and LLM workloads.
Strengthen data quality and model performance through well‑designed ETL/ELT pipelines, streaming systems, feature store integration, and workflow orchestration.
Ensure reliable and trustworthy AI operations by implementing comprehensive observability: logs, metrics, traces, and model/data drift detection.
Reduce operational risk by embedding security and compliance best practices — IAM, RBAC, VPC design, secrets management, and encryption — into every layer of the stack.
Increase automation, reduce manual toil, and support rapid experimentation by leveraging Python, Bash, and Terraform to script, codify, and automate infrastructure and ML workflows.

About you:

You have shipped and operated customer‑facing SaaS products used by real users at scale and bring hands‑on experience operating multi‑tenant SaaS with tenant isolation, per‑tenant controls, and enterprise security expectations.
You have previously owned end‑to‑end ML/AI infrastructure — from data ingestion and feature pipelines to training, deployment, and production monitoring.
You enable engineers and data scientists to move faster by building self‑service platforms, stable environments, and automated workflows that eliminate friction.
You have a track record of designing systems that scale globally across regions, workloads, and traffic patterns.
You’re comfortable participating in incident response and on‑call rotations, and you know how to stabilize and improve critical production systems.
You think with a product mindset, focusing on customer value, reliability, and speed‑to‑market rather than technology for its own sake.
You have a strong bias for automation — you eliminate manual operational toil by designing robust tooling and pipelines.
Very strong communication and collaboration skills — supporting other engineers, async collaboration, explaining technical decisions to non‑technical audiences, writing documentation, showing initiative.

Technical skills required

Proven patterns for tenant isolation (DB‑per‑tenant vs schema‑per‑tenant vs row‑level security), plus tenant‑aware caching and noisy‑neighbor protection (rate limits, quotas, scheduling).
Experience with OIDC/OAuth2, tenant‑aware RBAC/ABAC , SCIM provisioning, and audit logging requirements for B2B SaaS.
Deep Kubernetes experience: cluster ops, HPA/VPA , node pools, GPU scheduling, cluster autoscaler/Karpenter , PDBs, network policies, and multi‑AZ design.
Service mesh (Istio/Linkerd) and ingress patterns (ALB/Nginx), plus secure egress and mTLS (where applicable).
Strong requirement for Infrastructure as Code beyond Terraform basics: Terraform modules, Terragrunt, policy‑as‑code (OPA/Conftest), and secrets automation.
GitOps (ArgoCD/Flux) and progressive delivery (Argo Rollouts/Flagger), feature flags, canaries and blue/green.
Model lifecycle tooling: MLflow/W&B , model registry, experiment tracking, reproducible training, dataset/versioning (DVC/lakeFS ).
Pipeline orchestration (Airflow/Prefect/Dagster) + artifact stores.
Model serving patterns: online serving (KServe/Seldon/BentoML/Ray Serve), async/batch inference, autoscaling, and rollback strategies.
Experience with prompt/version management , offline + online evaluation harnesses, RAG evaluation (retrieval metrics, groundedness), guardrails, and red‑teaming basics.
Handling streaming inference (SSE/WebSockets), caching, routing, and fallback models.
Vector DB experience (pgvector / Pinecone / Weaviate / Milvus) and embedding lifecycle (backfills, re‑embedding, indexing strategies).
Explicit requirement for OpenTelemetry , tracing, and SLOs. Tools: Prometheus/Grafana, Loki/ELK, Datadog/New Relic—whatever you standardize on.
Incident mgmt: postmortems, runbooks, error budgets.
Requirements aligned to enterprise buyers: GDPR , encryption at rest/in transit, secrets mgmt (AWS Secrets Manager/Vault), KMS, key rotation.
SOC 2 / ISO 27001 familiarity, vulnerability scanning (Trivy/Grype), SBOMs, SAST/DAST, dependency management.