Company Overview
Procurement Sciences AI (PSci.AI) is at the forefront of generative artificial intelligence, transforming the government contracting sector as a Series A rocketship, proudly backed by Battery Ventures, a top 1% global technology venture capital firm. As a venture-backed B2B SaaS company, we are dedicated to revolutionizing federal, state, and local business development with disruptive AI capabilities. Our “Win More Bids” platform delivers unparalleled operational efficiencies for our clients and drives new revenue streams. By harnessing the power of generative AI tailored for the government contracting domain, we provide a unique competitive advantage and redefine what is possible for our customers.
Job Title: Site Reliability Engineer (SRE)
Location: Washington, DC metro area; Salt Lake City, UT; or Remote
Job Description
We are seeking an experienced and driven Site Reliability Engineer (SRE) to help ensure the reliability, performance, and scalability of our cloud-based AI solutions. The ideal candidate has a track record of diagnosing root causes, building automation, optimizing observability, and managing reliability in complex SaaS environments. Experience with Kubernetes, Helm, modern observability platforms, and major public cloud providers (Azure, AWS, Google Cloud Platform) is key. You will play a central role in defining and monitoring key reliability metrics, strengthening operational excellence, and championing DevOps culture across our rapidly growing organization.
Key Responsibilities
Identify and resolve system and application issues through in-depth root cause analysis, working closely with development teams and stakeholders.
Design, develop, and implement comprehensive automated testing to ensure ongoing system reliability and performance.
Build and maintain robust observability and monitoring solutions using Datadog, Prometheus, Grafana, ELK Stack, or similar platforms.
Define and monitor service level indicators (SLIs), service level objectives (SLOs), and service level agreements (SLAs) across services to meet operational commitments and improve reliability.
Collaborate with developers and operations staff to enhance system reliability, deployment agility, and overall developer experience (DevEx).
Develop and continuously improve monitoring and alerting systems to proactively address potential issues.
Lead and implement best practices for incident management, disaster recovery, and business continuity.
Manage high-impact incident response, facilitate post-mortem analyses, and drive remediation to prevent future occurrences.
Plan for capacity upgrades and scaling to support company growth and system performance requirements.
Automate operational tasks and infrastructure management using Infrastructure as Code (IaC) tools and related technologies.
Ensure all systems and processes comply with security, privacy, and regulatory requirements relevant to GovCon customers.
Continually assess and drive improvements in system architecture, operational processes, and documentation for systems and incidents.
Technical Requirements
Proficient in Kubernetes, Helm, and troubleshooting in secure and regulated environments.
Deep experience with observability and monitoring tools such as Prometheus, Grafana, ELK Stack, Datadog, or similar.
Hands-on expertise with major public cloud providers: Azure, Azure Gov, AWS, AWS GovCloud, and Google Cloud Platform (GCP).
Strong grasp of microservices architecture, cloud-native technologies, Postgres, and AI/ML systems.
Expertise in automated testing frameworks and practices (integration, synthetic, load testing, etc.).
Proficiency in tracking and analyzing reliability metrics (SLIs, SLAs, SLOs).
Excellent problem-solving skills and attention to detail, with the ability to operate independently and collaboratively.
Strong programming skills in TypeScript and Python.
Solid scripting abilities in Bash, PowerShell, or similar languages.
Demonstrated experience with Infrastructure as Code (IaC) tools such as Azure Bicep, AWS CDK, or Terraform.
Awareness of core networking principles and advanced troubleshooting skills.
Effective communicator, able to work with both technical and business personnel.
Preferred Qualifications
Experience in the GovCon sector and/or holding a security clearance.
Familiarity with GitOps principles and tools; experience with FluxCD is a plus.
Proven experience in designing, building, and maintaining CI/CD pipelines.
Experience managing reliability in multi-cloud or hybrid cloud environments.
Knowledge of security and compliance standards applicable to government SaaS and cloud systems.
Previous success operating in dynamic, high-growth SaaS companies.
Demonstrated expertise in operationalizing new development workloads across cross-functional teams.
Compensation and Benefits
Competitive salary and performance-based incentives, including stock options.
Comprehensive health plan for employees and their families.
Flexible remote-first work environment, with options to work from the DC metro area or Salt Lake City, UT.
Wide-ranging opportunities for professional development, technical advancement, and career growth.
To Apply:
Please submit your resume and a brief cover letter describing your experience with cloud-native reliability, Kubernetes, and observability in complex SaaS environments.
Notice: Background Check Required
As part of our employment process, a background check (including, but not limited to, credit history, criminal records, and employment verification) may be required in compliance with applicable law. By applying, you acknowledge and consent to this process.
Procurement Sciences AI is proud to be an equal opportunity employer committed to diversity and inclusion at all organization levels.
Typ:
VollzeitArbeitsmodell:
RemoteKategorie:
Development & ITErfahrung:
ErfahrenArbeitsverhältnis:
AngestelltVeröffentlichungsdatum:
17 Aug 2025Standort:
Möchtest über ähnliche Jobs informiert werden? Dann beauftrage jetzt den Fuchsjobs KI Suchagenten!