Big Data / Data Platform Site Reliability Engineer

Stellenbeschreibung:

Overview

We"re partnering with a fast-growing, data-intensive technology organisation to hire a Site Reliability Engineer focused on large-scale data platforms. This role sits at the heart of a mission-critical data environment, with responsibility for reliability, scalability and operational excellence across complex distributed systems. This is a senior, hands-on role for an engineer who enjoys owning infrastructure, improving system behaviour over time and operating close to production in high-throughput environments.

The role

Deploy, configure, monitor and maintain multiple large-scale data stores across distributed environments. Reliability, performance and availability are core to the role, with a strong focus on lifecycle management of critical data infrastructure.
Manage and evolve large Linux-based systems, ensuring predictable performance and high uptime. Define and document configuration standards, operational procedures and best practices that support long-term stability.
Perform performance and reliability testing, review system configuration, software choices and hardware decisions to identify improvement opportunities. Actively participate in incident response, root cause analysis and drive lasting reliability improvements across the platform.
Influence the direction of the technology stack by contributing ideas that improve resilience, observability and operational efficiency.

What we’re looking for

Strong hands-on experience operating large-scale Linux infrastructure in production environments. Comfortable owning complex systems and debugging issues across storage, compute and networking layers.
Deep, practical experience with Hadoop-based data platforms (HDFS architecture, security models and operational lifecycle management such as upgrades, scaling and recovery). Experience running Kafka clusters in production environments is also key.
Experience designing or improving automation and deployment workflows, with proficiency in scripting or automation using Python or shell scripting. Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security).
Comfortable taking technical ownership, contributing to on-call and incident processes, and driving continuous reliability improvement.
Position operates on East Coast US working hours and is suitable for engineers working remotely.

Additional experience

Experience with large-scale analytical query engines, distributed storage systems or high-availability databases is beneficial. Familiarity with observability platforms, configuration management tools, containerisation and Kubernetes in production environments is valuable.
Mentoring others and helping establish operational standards are desirable.

#J-18808-Ljbffr

NOTE / HINWEIS: