Platform-Engineer Devops

Stellenbeschreibung:

Design, deploy, and maintain Kubeflow (or equivalent) for pipeline orchestration, model training, evaluation, and serving on large image datasets; ensure reliability, security, and cost efficiency.
Manage and tune Kubernetes clusters (EKS/GKE/AKS), set up namespaces, RBAC, autoscaling, network policies, and service meshes where appropriate; keep upgrades and operations predictable.
Define infrastructure-as-code with Terraform; implement repeatable environment provisioning, configuration management, and golden paths for teams.
Establish CI/CD workflows (GitHub Actions/Jenkins/GitLab CI), build/test standards, and progressive delivery patterns that keep releases fast and low-risk.
Implement logging, metrics, and tracing (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) with actionable SLOs, alerts, and runbooks; embed security and compliance by design.
Collaborate closely with product and science teams to remove bottlenecks, eliminate manual steps, and evolve service and data interfaces that make operating image pipelines simple and reliable.
Contribute to future-state architectures that improve scalability, resiliency, and operational efficiency; lead targeted refactors and platform improvements.
Manage core automation and tooling, and educate teams on platform capabilities, CI/CD, configuration management, and infrastructure automation best practices.

Main Duties And Responsibilities

Design, deploy, and maintain Kubeflow (or equivalent) for pipeline orchestration, model training, evaluation, and serving on large image datasets; ensure reliability, security, and cost efficiency.
Manage and tune Kubernetes clusters (EKS/GKE/AKS), set up namespaces, RBAC, autoscaling, network policies, and service meshes where appropriate; keep upgrades and operations predictable.
Define infrastructure-as-code with Terraform; implement repeatable environment provisioning, configuration management, and golden paths for teams.
Establish CI/CD workflows (GitHub Actions/Jenkins/GitLab CI), build/test standards, and progressive delivery patterns that keep releases fast and low-risk.
Implement logging, metrics, and tracing (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) with actionable SLOs, alerts, and runbooks; embed security and compliance by design.
Collaborate closely with product and science teams to remove bottlenecks, eliminate manual steps, and evolve service and data interfaces that make operating image pipelines simple and reliable.
Contribute to future-state architectures that improve scalability, resiliency, and operational efficiency; lead targeted refactors and platform improvements.
Manage core automation and tooling, and educate teams on platform capabilities, CI/CD, configuration management, and infrastructure automation best practices.

Required (Must-have)

M.Sc. in Computer Science/Engineering (or equivalent) or comparable industry experience.
Practical, production experience operating Kubeflow Pipelines for reproducible ML workflows at scale.
Proven experience deploying and operating workloads on Kubernetes (EKS/GKE/AKS), including upgrades, autoscaling, RBAC, networking, and reliability; strong Unix/Linux fundamentals.
Hands-on experience with AWS services (EKS, EC2, S3, IAM, CloudWatch; RDS a plus) and the ability to design secure, cost-aware architectures.
Strong Terraform skills and Git-based workflows for repeatable infrastructure provisioning and configuration management.
Practical experience with CI/CD platforms (GitHub Actions/Jenkins/GitLab CI), including artifact management, environment promotion, and progressive delivery.
Solid Python and/or shell scripting for platform automation and toil reduction.
Experience implementing logging, metrics, and tracing with SLOs, alerts, and runbooks (e.g., Prometheus, Grafana, CloudWatch, Splunk/New Relic) and a security-first mindset.
Ability to lead technical initiatives, communicate trade-offs clearly, and collaborate effectively with engineering and science teams

Desirable (Nice To Have)

Experience with MLflow, Feast, Argo, Airflow, Ray, and model versioning/monitoring.
Familiarity with S3/object storage, artifact registries, and handling large image datasets; basic SQL/NoSQL exposure.
Experience with digital pathology or large-scale image processing (e.g., whole-slide images) and tools like OpenSlide, scikit-image, or OpenCV.
Experience tuning high-throughput pipelines, concurrency, memory usage, and integrating GPUs/accelerators.
Experience with VPC design, ingress/egress, service meshes, secrets management, IAM, and policy as code.
Experience in regulated environments (e.g., GxP), including data governance, privacy, and building software under regulated processes.
Experience with Jira/Zendesk and with JavaScript/TypeScript for internal tools or dashboards.

#J-18808-Ljbffr

NOTE / HINWEIS: