Site Reliability Engineer

Remote · USA Full-time New today

Basic Qualifications Bachelor's degree in Software Engineering, or related Science, Technology, Engineering or Mathematics field, plus a minimum of 8 years of relevant experience; or Master's degree, plus 6 years relevant experience. Responsibilities for this Position What You'll Own SLOs and reliability metrics. Define service level objectives for every AI service that goes to production. Establish error budgets and use them to drive engineering decisions — not just measure uptime. Monitoring and observability. Build and maintain monitoring, logging, and alerting infrastructure for AI services. You will know when something is degrading before users do. Incident response. Establish incident management procedures, lead post-incident reviews, and drive corrective actions. When something breaks, you coordinate the response and ensure it doesn't break the same way again. Operational readiness reviews. Before any AI service goes live, you validate that it meets reliability, security, and operational standards. You are the gate between "it works in dev" and "it's ready for production." Capacity planning and cost monitoring. Track resource consumption, forecast capacity needs, and monitor costs — tokens, compute, storage. You ensure the platform scales without surprises. Toil elimination. Identify and automate repetitive operational tasks. If a human is doing something a script could do, you fix that. What You Won't Own Application development or AI model building — you ensure what they build is operable, you don't build it Infrastructure provisioning — IT provides the infrastructure; you define what's needed and validate it works Business process decisions or backlog prioritization What Makes This Role Different AI services have failure modes that traditional applications don't — model drift, token budget exhaustion, prompt injection, upstream data quality degradation. You will build monitoring for problems that most SRE teams have never encountered. You are applying SRE principles from scratch. There is no existing SRE practice to inherit — you will define it for the platform. Your operational readiness reviews directly determine whether AI services go live. You have real authority to say "not ready."

Required Qualifications

Bachelor’s degree in Computer Science, Software Engineering, or a related field, plus 5 years of experience; or Master’s degree plus 3 years of experience Production SRE or DevOps experience — you have owned the reliability of systems that real users depended on, not just built CI/CD pipelines Hands-on experience with monitoring and observability tools — Prometheus, Grafana, Datadog, ELK, CloudWatch, or similar. You have built dashboards and alerts that caught real problems. Strong scripting and automation skills — Python, Bash, infrastructure-as-code (Terraform, CloudFormation, or similar) Experience with containerized environments — Docker, Kubernetes, container orchestration at scale Experience defining and managing SLOs, error budgets, and incident response procedures in production S. citizenship required. Department of Defense Secret security clearance is required at time of hire.

Preferred Qualifications

Experience with AI/ML production systems — model serving, inference monitoring, token cost tracking, or similar Multi-cloud experience (AWS, Azure, GCP) including cloud-native monitoring and logging services Experience building operational readiness review processes or production launch checklists Familiarity with Google SRE principles — you have read the book and applied the concepts, not just referenced them in interviews Experience in environments where reliability has compliance or safety implications — defense, healthcare, finance, or critical infrastructure What Sets You Apart You think about failure before you think about features. Your first question about any new system is "how does this break?" You automate yourself out of toil. If you're doing the same thing twice, you write a script. You have said "not ready" to a team that wanted to ship, and you were right. You build monitoring that tells you what's wrong, not just that something is wrong. You write post-incident reviews that actually change how systems are built, not just how incidents are documented. Details Remote — 100% telework 9/80 schedule Defense industry experience is not required Salary Note This estimate represents the typical salary range for this position based on experience and other factors (geographic location, etc.). Actual pay may vary. This job posting will remain open until the position is filled. Combined Salary Range USD $142,696.00 - USD $158,303.00 /Yr. Company Overview General Dynamics Mission Systems (GDMS) engineers a diverse portfolio of high technology solutions, products and services that enable customers to successfully execute missions across all domains of operation. With a global team of 12,000+ top professionals, we partner with the best in industry to expand the bounds of innovation in the defense and scientific arenas. Given the nature of our work and who we are, we value trust, honesty, alignment and transparency. We offer highly competitive benefits and pride ourselves in being a great place to work with a shared sense of purpose. You will also enjoy a flexible work environment where contributions are recognized and rewarded. If who we are and what we do resonates with you, we invite you to join our high-performance team! Equal Opportunity Employer / Individuals with Disabilities / Protected Veterans Apply To This Job

Apply

Site Reliability Engineer

Required Qualifications

Preferred Qualifications

Related roles

Product Owner — AI Reliability Engineering

Artificial Intelligence (AI) Engineer

Underwriting Manager

U.S. Tax & Accounting Manager

Account Manager

Environmental Health & Safety Specialist

Privacy Associate Analyst

Director - Offensive Security - Red Team

Junior C#/.NET Software Engineer (Remote)

Director of Advancement Services (may be remote)

Technical Support / Customer Service Representative (Remote) – Join arenaflex's Global Team of Game-Changers

Experienced Full Stack Data Entry Specialist – Remote Operations Support

Insurance Agent — Remote (U.S. Based)

Principal Vulnerability Management Engineer

Enterprise Onboarding Specialist - 100% Remote - North America

Family Lawyer

Experienced Part-Time Data Entry Operator – Flexible Scheduling Opportunities at arenaflex

Experienced Customer Success Manager – Driving Business Growth and Retention at arenaflex Orlando, FL

[Hiring] Billing & Coding Specialist @Atlantic Vision Partners

Experienced Remote Customer Support Associate – Flexible Hours & Competitive Pay Up to $19 Per Hour