Site Reliability Engineer (m/f/d)

Ververica | Original creators of Apache Flink®Munich

Stellenbeschreibung:

Ververica, founded by the original creators of Apache Flink, empowers businesses to unlock the full potential of real-time data processing and analytics. Our platform provides cutting-edge stream processing and event-driven applications, enabling companies worldwide to build scalable and reliable data-driven solutions.

Role Overview As a Site Reliability Engineer (SRE) at Ververica, you will design, provision, and maintain the infrastructure for Ververica's Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure. You will collaborate with software engineering teams to develop solutions that enhance feature delivery, optimize performance, and address security vulnerabilities. Your role will involve architectural improvements, implementation ownership, and driving reliability best practices.

Key Responsibilities Build and maintain the infrastructure for Ververica's Unified Streaming Data Platform across AWS, GCP, and Azure Design and manage Infrastructure as Code (IaC) using Terraform, ensuring modularity, reusability, and best practices Implement and enhance observability tooling, including Grafana, Prometheus, logging systems, traces, metrics, dashboards, and alerts Ensure system reliability through SRE best practices, including defining SLIs, SLOs, and error budgets Improve infrastructure architecture and engineering efficiency through continuous evaluation and optimization Enhance CI/CD pipelines to automate development workflows Monitor, identify, and resolve security vulnerabilities (CVE updates and security enhancements) Contribute to the successful development and launch of new products, features, and services Periodically participate in on-call rotations to manage incidents in a 24/7 live infrastructure Maintain and update documentation, including architectural designs and changes About Ververica Ververica, founded by the original creators of Apache Flink, empowers businesses to unlock the full potential of real-time data processing and analytics. Our platform provides cutting-edge stream processing and event-driven applications, enabling companies worldwide to build scalable and reliable data-driven solutions.

Role Overview As a Site Reliability Engineer (SRE) at Ververica, you will design, provision, and maintain the infrastructure for Ververica's Unified Streaming Data Platform across multiple cloud providers, including AWS, GCP, and Azure. You will collaborate with software engineering teams to develop solutions that enhance feature delivery, optimize performance, and address security vulnerabilities. Your role will involve architectural improvements, implementation ownership, and driving reliability best practices.

Key Responsibilities Build and maintain the infrastructure for Ververica's Unified Streaming Data Platform across AWS, GCP, and Azure Design and manage Infrastructure as Code (IaC) using Terraform, ensuring modularity, reusability, and best practices Implement and enhance observability tooling, including Grafana, Prometheus, logging systems, traces, metrics, dashboards, and alerts Ensure system reliability through SRE best practices, including defining SLIs, SLOs, and error budgets Improve infrastructure architecture and engineering efficiency through continuous evaluation and optimization Enhance CI/CD pipelines to automate development workflows Monitor, identify, and resolve security vulnerabilities (CVE updates and security enhancements) Contribute to the successful development and launch of new products, features, and services Periodically participate in on-call rotations to manage incidents in a 24/7 live infrastructure Maintain and update documentation, including architectural designs and changes

Requirements

Bachelor's degree in Computer Science, Information Technology, or a related field Minimum 2 years of hands-on experience with Kubernetes clusters, Helm charts, controllers, and operators Proficiency in designing and maintaining Terraform code with best practices Strong knowledge of observability tools and practices, including metrics, logging, and alerting systems Experience implementing SRE principles such as SLIs, SLOs, and error budgets Solid understanding of Linux systems and networking in cloud environments Hands-on experience managing multiple Kubernetes clusters Familiarity with distributed systems or streaming data platforms Knowledge of cloud-native security best practices Seniority level

Seniority level Mid-Senior level Employment type

Employment type Full-time Job function

Job function Other Industries IT Services and IT Consulting Referrals increase your chances of interviewing at Ververica | Original creators of Apache Flink by 2x Sign in to set job alerts for “Site Reliability Engineer” roles.

Senior Site Reliability Engineer (w/m/d)

Senior Site Reliability / Gitops Engineer

Software Engineer (Python/Linux/Packaging)

Python and Kubernetes Software Engineer - Data, AI/ML & Analytics

Frontend software engineer (React) - Europe Remote

Senior Software Development Engineer in Test

Software Engineer – Junior bis Senior (all levels) / Banking (all genders)

Software Engineer - Python - Container Images

Software Engineer - Python - Container Images

Software Engineer - Python - Container Images

Python and Kubernetes Software Engineer - Data, Workflows, AI/ML & Analytics

Python Backend Senior Software Engineer - Remote 4 days a week (Europe)

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

NOTE / HINWEIS: