Team: Product & Engineering • Reports to the CTO
Location: Hybrid - Cologne (Rheinauhafen) - 3 days in office, 2 days remote (Tue and Thu)
Shape the future of autonomous incident response
We’re on a mission to make downtime invisible. Thousands of DevOps and SRE teams rely on ilert to detect, resolve, and communicate incidents faster.
As our first AI Product Engineer , you’ll build the core of ilert’s AI-first strategy: autonomous, tool-using agents that diagnose alerts, run root cause analysis, execute safe mitigations, and keep services healthy.
This is a hands-on role where you’ll turn operational expertise and product insight into real, reliable AI systems used in production.
Tasks
Design & Build AI Agents
- Design agent reasoning loops, prompts, and safety constraints.
- Build multi-step tool-using agents (logs, metrics, traces, k8s, Git, CI/CD, cloud APIs).
- Implement autonomy flows: investigation → analysis → mitigation → validation.
Ship Product Features
- Work with product and engineering to build AI-backed features that solve real customer problems.
- Translate complex SRE workflows into intuitive user experiences powered by AI.
- Own features end-to-end (design → prototype → implementation → rollout).
Integrate with Observability & Ops Tooling
- Connect LLM agents to Grafana, Prometheus, Kubernetes, GitHub, CI/CD, cloud services, etc.
- Design safe tool schemas and APIs for autonomous execution.
Ensure Reliability, Safety & Determinism
- Build guardrails for safe, reversible mitigations.
- Validate model output with structured schemas (e.g., Zod, JSON schema).
- Establish evaluation suites, test harnesses, and monitoring for agent performance.
Collaborate Across Teams
- Work with SREs to encode operational expertise into agents.
- Partner with Product to shape requirements and roadmap decisions.
- Influence ilert’s broader AI strategy.
Requirements
Must-Have Skills
- Experience building AI-powered applications with LLMs (OpenAI, Anthropic, etc.)
- Strong prompt engineering & agent design skills
- Experience implementing multi-step tool-use flows
- Solid software engineering fundamentals (preferably Rust)
- Experience integrating with APIs, backend services, or automations
- Ability to reason about reliability, safety, and controlled automation
- Product mindset: able to turn ambiguous problems into shippable solutions
Nice-to-Have Skills
- Background in SRE, DevOps, or incident response
- Experience with observability tools (Grafana, Prometheus, Elastic, Datadog, New Relic)
- Hands-on Kubernetes knowledge
- Experience with production agent frameworks (ReAct, LangChain, LangGraph, custom state machines)
Soft Skills
- You love building real products, not demos
- Strong communication & critical thinking
- Comfortable working with high autonomy and ownership
- Passion for reliability, automation, and removing toil
Benefits
- Build one of the first real autonomous SRE agents in the industry
- Product-centric culture: Be part of a team that’s 100% committed to solving a critical issue for businesses that offer round-the-clock services.
- Hybrid Work Environment: Enjoy the best of both worlds with in-person collaboration and remote work flexibility.
- No Meetings #hackfwd: Maximize productivity by keeping meetings to a minimum and focusing on your core responsibilities.
- High impact, high ownership role. Your work ships to customers quickly
- Small, senior team with fast decision-making
- Modern tech stack + strong engineering culture
- Direct involvement in shaping the future of on‑call and incident responseFounder‑led startup
Please include one link (GitHub, repo, notebook, or demo) that best showcases your experience building AI-powered or agentic systems.
#J-18808-Ljbffr