AI Cloud Platform Site Reliability Engineer

Booz Allen Hamilton

Locations San Antonio, Texas, USA Posted 1 day ago

$99,000 - $225,000/year

Role Details

AI Cloud Platform Site Reliability Engineer

The Opportunity:

Mission users are increasingly relying on agentic AI systems to support complex workflows, accelerate analysis, and improve decision advantage. Unlike traditional software systems, agentic AI platforms introduce operational complexity across model invocations, workflow orchestration, tool integrations, retrieval and knowledge layers, safety controls, and probabilistic outputs. As an AI Platform Site Reliability Engineer (SRE), you’ll help ensure the availability, resiliency, observability, and operational integrity of an AWS GovCloud-based agentic AI platform supporting national defense missions. 

In this role, you’ll serve as the reliability owner for production AI operations. You’ll work cross-functionally with multiple stakeholders, including with cloud engineering, platform engineering, AI agent development, MLOps, data science, and customer knowledge teams to operationalize their work in production through monitoring, alerting, Service Level Indicators (SLI) and Service Level Objectives (SLO) management, incident response, ticket triage, change control, and automation. You won’t be duplicating model development, data science, or cloud platform build responsibilities. Instead, you’ll ensure that the system, its agents, and their supporting services remain healthy, traceable, performant, and supportable in mission environments.

You’ll define and monitor operational health signals across agent workflows, model latency, session and task success, knowledge-base ingestion health, tool and API dependencies, guardrail or safety interventions, throttling, token usage, drift indicators, and service degradation patterns. You’ll help reduce operational toil by building dashboards, alarms, runbooks, and automated remediation workflows, while driving post-incident learning and continuous improvement.

How You’ll Contribute

  • Define, implement, and maintain service level indicators, service level objectives, error budgets, dashboards, alarms, and escalation paths for an agentic AI platform operating in AWS GovCloud.
  • Monitor end-to-end health and performance of agent workflows, model invocations, retrieval or knowledge integrations, orchestration steps, tool calls, and dependent services.
  • Triage incidents, alerts, and operational tickets. Lead root-cause analysis, coordinate recovery actions, and drive post-incident corrective actions that reduce mean time to recovery and prevent recurrence.
  • Build and maintain observability pipelines across metrics, logs, traces, audit telemetry, and operational events using AWS-native tooling and approved enterprise observability tooling.
  • Establish and tune operational thresholds for latency, availability, error rates, token and cost consumption, workflow success rates, tool failure rates, guardrail interventions, and drift-related signals.
  • Partner with platform engineers, cloud engineers, AI agent developers, MLOps engineers, data scientists, and customer SMEs to define ownership boundaries, handoffs, rollback criteria, release readiness gates, and operational support models.
  • Coordinate with MLOps and data science teams when model or data quality degradation, drift, or unexpected behavior requires rollback, retraining, prompt changes, knowledge-base updates, or other corrective actions.
  • Automate remediation and routine operational tasks using Python, shell scripting, infrastructure as code, and event-driven workflows to reduce manual toil.
  • Support secure and compliant operations in regulated national defense environments, including auditability, least-privilege access, controlled logging, and disciplined change management.
  • Work with limited direction, mentor junior team members, and help mature AI operations practices across the program.

Grow your skills at the leading edge of innovation.

Join us. The world can’t wait.

You Have:

  • 5+ years of experience supporting production distributed systems such as SRE, Platform Engineering, Cloud Operations, or DevOps
  • Experience operating workloads on AWS including monitoring, alerting, logging, incident response, troubleshooting, IAM, networking, or secure operations
  • Experience supporting production AI/ML, generative AI, RAG, agentic AI, model‑serving, or data‑driven decision systems
  • Experience defining and operating SLIs, SLOs, error budgets, alert thresholds, runbooks, or operational readiness criteria
  • Experience with observability tooling across metrics, logs, traces, dashboards, or log analytics, including CloudWatch, OpenTelemetry, Prometheus, Grafana, OpenSearch, or ELK
  • Experience diagnosing issues across containers, orchestration platforms, or cloud runtimes, such as EKS, ECS, Lambda, or EC2
  • Experience with Python, Bash, or scripting languages to automate operational tasks, health checks, or remediation workflows
  • Experience participating in on‑call rotations, triaging ticket queues, and leading incident response or post‑incident review activities
  • Secret clearance
  • Bachelor’s degree

Nice If You Have:

  • Experience with Amazon Bedrock, Bedrock Agents, Guardrails, Knowledge Bases, model invocation logging, EventBridge, CloudTrail, and CloudWatch‑based monitoring for AI workloads or equivalent tooling for production agentic AI systems
  • Experience supporting AWS workloads in GovCloud, FedRAMP High, DoD SRG IL4/5, or other regulated or high‑assurance environments
  • Experience with automation and infrastructure as code using Terraform, CloudFormation, or AWS CDK
  • Experience with CI/CD release engineering, canary strategies, rollback controls, and change management for cloud services and AI‑enabled applications
  • Experience with Prometheus‑compatible monitoring, Grafana, OpenSearch/ELK, or other enterprise observability stacks in containerized environments
  • Experience supporting GPU‑backed inference, self‑hosted model serving, or hybrid AI deployments if the platform evolves beyond managed services
  • Ability to distinguish infrastructure issues from AI‑specific failure modes including workflow breakdowns, degraded retrieval, safety interventions, regressions, stale knowledge sources, and model or service throttling
  • Experience working in Agile and cross‑functional environments and collaborating with engineers, operators, mission stakeholders, and technical leadership
  • AWS Certified CloudOps Engineer ,Associate AWS Certified DevOps Engineer, Professional AWS Certified Machine Learning Engineer, Associate AWS Certified Generative AI Developer, Professional AWS Certified Security, or Specialty cloud and AI operations Certifications
  • CompTIA Security+ or DoD 8570/8140 baseline Certification

Clearance:

Applicants selected will be subject to a security investigation and may need to meet eligibility requirements for access to classified information; Secret clearance is required.

Compensation

At Booz Allen, we celebrate your contributions, provide you with opportunities and choices, and support your total well-being. Our offerings include health, life, disability, financial, and retirement benefits, as well as paid leave, professional development, tuition assistance, work-life programs, and dependent care. Our recognition awards program acknowledges employees for exceptional performance and superior demonstration of our values. Full-time and part-time employees working at least 20 hours a week on a regular basis are eligible to participate in Booz Allen’s benefit programs. Individuals that do not meet the threshold are only eligible for select offerings, not inclusive of health benefits. We encourage you to learn more about our total benefits by visiting the Resource page on our Careers site and reviewing Our Employee Benefits page.

Salary at Booz Allen is determined by various factors, including but not limited to location, the individual’s particular combination of education, knowledge, skills, competencies, and experience, as well as contract-specific affordability and organizational requirements. The projected compensation range for this position is $99,000.00 to $225,000.00 (annualized USD). The estimate displayed represents the typical salary range for this position and is just one component of Booz Allen’s total compensation package for employees. This posting will close within 90 days from the Posting Date.

Identity Statement

As part of the hiring process, we will ask you to complete an identity verification process that leverages advanced biometrics and artificial intelligence to ensure authenticity and protect against identity fraud. You are expected to be on camera during interviews and assessments. We reserve the right to take your picture to verify your identity and prevent fraud.

Candidate AI Usage Policy

AI is a part of our daily work at Booz Allen, and we are committed to the responsible and ethical use of AI tools. However, we want to ensure a fair candidate process based on your own skills and knowledge. As part of this commitment, the use of artificial intelligence (AI) or other tools to assist with responses during interviews (whether in-person or virtual) is prohibited unless permission is explicitly provided.

Work Model
Our people-first culture prioritizes the benefits of collaboration, whether it occurs in person or virtually. To support engagement and effective communication, employees working virtually are generally expected to have their cameras on during meetings.

  • Remote: If this position is listed as remote, there may still be occasions when you are required to work in person at a Booz Allen or customer facility.
  • Hybrid: If this position is listed as hybrid, you will be expected to work from a Booz Allen facility frequently, in alignment with leadership expectations and the needs of the role. You may also be required to work from or visit a customer facility.
  • Onsite: If this position is listed as onsite, work will primarily be performed at a Booz Allen office or customer facility, where employees will collaborate directly with colleagues and customers as required by the role.

Commitment to Non-Discrimination

All qualified applicants will receive consideration for employment without regard to disability, status as a protected veteran or any other status protected by applicable federal, state, local, or international law.

For more details click Job Post.

About Booz Allen Hamilton

Booz Allen Hamilton is a management and technology consulting firm that provides analytics, digital, engineering, and cybersecurity solutions primarily to U.S. government agencies and commercial clients. Industry: Management & Technology Consulting