Operations Expertise— Storage (All Genders)

TEKsystemsFrankfurt

Stellenbeschreibung:

Job Title

Location

Frankfurt am Main, Germany

Job Type

Contract

Description

Local Operations manages the on‑premises production platform which serves as the primary host for all mission‑critical business applications. Local operations are responsible for the following core areas:

Platform Stability: ensuring high availability and performance of the on‑premises private cloud environment.
Application Hosting: consulting on the seamless operation of Germany‑specific productive business applications.
Incident Management: resolving technical issues within standard business hours to minimize operational downtime.
Lifecycle Maintenance: executing routine updates, patches, and system optimizations within the local infrastructure.

Objective

Provide Tier‑3 operational ownership for Storage Products for Local Production (DE).

Scope of Work

Ensuring operational readiness for storage changes.
Monitoring and alerting coverage, performance baselines, hardening, patch strategy, rollback and recovery procedures, and runbooks.
Executing and improving standard operational procedures through automation (reduce toil, improve MTTR and stability).
Automation of standard operational tasks (capacity checks, validation procedures, provisioning workflows where applicable).

Support and Operational Readiness

Validate deployment artifacts from an operations perspective.
Define and enforce quality assurance measures, such as required documentation of standard operation procedures and successful test reports, to ensure high quality of delivered products and services.
Ensure rollback strategies and operational monitoring (observability) are in place for production deployments.

Monitoring, Incident, Problem and Change Management

Monitor system health, performance metrics, and service availability across multi‑tenant environments.
Identify, analyze, and resolve incidents to minimize service disruption.
Trigger root cause analysis and implement corrective and preventive actions.

Automation of Operations Critical Standard Processes

Address operational issues by automating remedial standard operations processes.
Validate all automated procedures following the established software development lifecycle, including staging, testing, and validation reviews.

Security and Compliance Enforcement

Implement monitoring and logging strategies to support audit and compliance requirements.
Perform routine security scans and remediate identified vulnerabilities.

Qualifications

Must-have experience

5+ years in IT storage operations / service delivery / platform operations with demonstrated leadership in mission‑critical environments.
Proven experience implementing or leading Incident, Problem, Change, Release governance in production.
Experience supporting platform workloads that rely on shared storage services.
Expertise with storage types: File, Block, and Object Storage.
Expertise with protocols/services: NFS; object storage operations (S3‑like concepts).
Experience with Kubernetes storage integration: CSI driver concepts and troubleshooting (PV/PVC lifecycle understanding).
Virtualization (storage): experience operating storage virtualization in enterprise environments.
Expertise within ITSM: Jira Service Management, Jira, Confluence.
Fundamental understanding of core operations processes (incident, change, problem management, ITSM) as well as SRE concepts.
Experience gathering operational insights from monitoring or observability including SLI/SLA/SLO management and tracking.
Hands‑on experience documenting procedures and enforcing clear runbooks or playbooks.
Observability hands‑on experience with monitoring and logging tools (Prometheus, Grafana, Datadog, Mimir, Loki).
Familiarity with enterprise DevOps toolchains is a plus (GitLab, JFrog Artifactory, Backstage, Harness).
Strong understanding of modern platform operations (Kubernetes/containers, automation, observability), sufficient to govern specialists.
Platform delivery concepts: GitOps and IaC awareness (Terraform/OpenTofu, ArgoCD, Helm) to govern deployment/readiness standards.

Preferred experience

Experience operating in regulated / high‑availability industries (banking, public sector, healthcare).
Experience with SRE practices (SLOs/SLIs, error budgets) and reliability management.
Experience operating storage services that integrate with Kubernetes platforms.
Familiarity with IaC‑based provisioning and GitOps‑driven operational patterns.