(Senior) Cloud Infrastructure Engineer

LangfuseLocation Not Available

Stellenbeschreibung:

Why Cloud Infrastructure at Langfuse

Your work will keep Langfuse running — everywhere.

Langfuse processes over a billion trace events per month. When a Fortune 50 company relies on Langfuse in production, they’re relying on the infrastructure you operate. You’ll own uptime, performance, and cost efficiency across our entire cloud footprint — and you’ll make sure every self-hosted deployment runs just as smoothly.

You’ll operate Langfuse Cloud on AWS ECS Fargate and ClickHouse Cloud, with Datadog as the observability backbone. You’ll also own our public self-hosted infrastructure — including our Helm chart, Docker Compose setup, and everything in between — so that teams from startups to enterprises can run Langfuse on their own terms.

This isn’t a “maintain what exists” role. We’re scaling fast, and you’ll be the person who makes sure the infrastructure grows ahead of demand — not behind it.

Langfuse is now part of ClickHouse, which means the team behind the database at the core of our stack is one channel away. Few infrastructure roles give you that kind of direct access to the people who build your most critical dependency.

You will grow at Langfuse by

Own Langfuse Cloud operations: You’ll run our production environments on AWS ECS Fargate and ClickHouse Cloud. You’ll manage deployments, autoscaling, capacity planning, and cost optimization — making sure we stay fast and affordable as traffic scales.
Build world‑class observability: You’ll own our Datadog setup end to end — dashboards, alerts, and SLOs. When something degrades, you’ll ensure we know before our customers do. You’ll build the monitoring culture that lets the whole team ship with confidence.
Make self‑hosting effortless: Thousands of teams run Langfuse on their own infrastructure. You’ll own and evolve our Helm chart, Docker Compose configuration, and deployment documentation. You’ll turn “works on my machine” into “works on every machine” — from a single‑node setup to a multi‑region enterprise deployment.
Automate everything: CI/CD pipelines, infrastructure‑as‑code, automated scaling, zero‑downtime deployments. You’ll replace manual processes with automation that makes the team faster and the platform more reliable.
Scale for what’s next: We’re growing fast and new product directions — like complex long‑running agent observability and real‑time evaluation — push the infrastructure in new ways. You’ll be thinking ahead about what breaks at 10x scale and building the foundation before we get there. 10x is always just one quarter away here at Langfuse.
Harden security and compliance: As more enterprises adopt Langfuse, you’ll help ensure our cloud and self‑hosted deployments meet the security and compliance bar that large organizations require.

What we’re looking for

Strong infrastructure or SRE engineer who gets excited about running systems at scale and making them better every day
Experience operating production workloads on AWS (ECS/Fargate, networking, IAM, S3, etc.) or on comparable hyperscale vendors.
Comfortable with container orchestration — Kubernetes and/or ECS, Helm charts, Docker
Experience with infrastructure‑as‑code (Terraform, Pulumi, CloudFormation, or similar)
Strong monitoring and observability instincts — you’ve built dashboards and alerts that actually caught problems (Datadog experience is a plus)
You organize yourself. You have strong opinions about reliability, automation, and how to ship infrastructure changes safely
Interest in open source software and genuine enjoyment helping users debug their self‑hosted deployments
Thrives in a small, accountable team where your output is visible and matters
CS or quantitative degree preferred

Bonus points

Experience with ClickHouse Cloud or other managed analytical databases
Background in operating high‑throughput event processing or observability infrastructure
Contributions to open source infrastructure tooling (Helm charts, Terraform modules, etc.)
Former founder

Process

We can run the full process to your offer letter in less than 7 days .

Tech Stack

We run a TypeScript monorepo: Next.js on the frontend, Express workers for background jobs, PostgreSQL for transactional data, ClickHouse for tracing at scale, S3 for file storage, and Redis for queues and caching. You should be familiar with a good chunk of this, but we trust you’ll pick up the rest quickly.

How we ship

We trust you to take ownership for your area. You identify what to build, propose solutions (RFCs), and ship them. Everyone here thinks about the user experience and the technical implementation at the same time. Everyone manages their own Linear.
You're never alone. Anyone from the team is happy to go into a whiteboard session with you. 15 minutes of shared discussion can very much improve the overall output.
We implement maker schedule and communication. There are two recurring meetings a week: Monday check‑in on priorities (15 min) and a demo session on Fridays (60 min).
Code reviews are mentorship. New joiners get all PRs reviewed to learn the codebase, patterns, and how the systems work.
We use AI as much as possible in our workflows to make our users happy. We encourage everyone to experiment with new tooling and AI workflows.

#J-18808-Ljbffr

NOTE / HINWEIS: