FDE · Production agentic ops
Agentic GPU fleet remediation
Jason Larkin · 2026-05-21 · ~18 min read
Reduced GPU idle time and incident MTTR via SLURM-aware auto-remediation and observability integrated with agentic ops tooling.
Abstract
GPU cloud operators face a volume of node-level incidents that outpaces manual triage. We describe an embedded forward-deployed engagement with a Tier-1 GPU cloud provider to map SLURM and DCGM failure modes, wire observability to structured remediation playbooks, and integrate MCP-accessible tooling so both human operators and agentic workflows could close the loop from alert to verified recovery. The system prioritizes guardrails: destructive actions require explicit approval, and automation rolled out through shadow and canary phases. Quantitative MTTR and escalation improvements are reported as [TBD] pending client validation. We discuss limitations around fleet heterogeneity and the need for ongoing playbook maintenance as hardware generations change.
1. Context & constraints
The provider operates a multi-tenant GPU fleet scheduled by SLURM. Tenants range from long-running distributed training jobs to bursty inference workloads. Platform leadership defined success as:
- Reduce time nodes spend in unhealthy or drained states without increasing risk to running jobs.
- Lower manual escalations for known fault signatures.
- Leave behind runbooks and tooling the internal team could operate without permanent FDE staffing.
Constraints included: no broad SSH access for automated agents, change windows aligned to tenant SLAs, and anonymized public storytelling (no client name, fleet size, or datacenter identifiers in marketing materials).
2. Failure modes & observability
We grouped recurring incidents into fault classes before designing automation.
| Signal | Symptom | Prior manual response | Automation candidate |
|---|---|---|---|
| DCGM GPU XID / ECC | Job failures, health check red | Drain node, reset GPU, verify DCGM | Conditional drain + reset playbook with job-awareness |
| SLURM node DOWN / DRAIN | Queue backlog, scheduling gaps | Inspect slurmctld logs, power cycle | Scripted health probe → drain reason annotation |
| DCGM thermal / power | Throttling, perf regression | Migrate jobs, hardware ticket | Alert enrichment + Jira template |
| Intermittent fabric / NIC | NCCL timeouts | Isolate node, network team | Read-only diag bundle attachment to ticket |
| Stale compute state | Ghost jobs, cgroups mismatch | Manual scontrol cleanup | Guarded cleanup playbook (HITL required) |
Grafana dashboards were refactored so each alert included: node identity, active job count, last successful playbook run, and deep links to Jira if an incident already existed.
3. Architecture & remediation loop
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ DCGM/Grafana│────▶│ Alert + ctx │────▶│ Agent / operator│
└─────────────┘ └──────────────┘ └────────┬────────┘
│
┌──────────────┐ ▼
│ Verify + │◀─────── MCP tool layer
│ close Jira │ (metrics, tickets,
└──────▲───────┘ remediation CLI)
│
┌──────┴───────┐
│ Runbook │
│ (guardrails) │
└──────┬───────┘
▼
┌──────────────┐
│ SLURM drain/ │
│ restore │
└──────────────┘Guardrails
- Read-only tools (metrics query, log fetch, ticket read) available to agents without approval.
- State-changing playbooks require role-based approval or pre-approved canary node lists.
- Every execution logs: initiator (human vs agent), playbook version, pre/post node state.
4. Agentic tooling integration
Tooling was exposed through MCP-compatible interfaces so agent frameworks and operator CLIs shared capabilities:
| Tool | Capability | Agent autonomy |
|---|---|---|
| Grafana query | PromQL/label-scoped metrics | Autonomous read |
| Jira | Create/update incident, attach diag | Autonomous create; close requires human |
| SLURM read | Node/job state via controlled API | Autonomous read |
| Remediation CLI | Drain, GPU reset, health probe | HITL or canary-only |
| Runbook catalog | List/playbook metadata | Autonomous read |
This avoided one-off agent prompts that bypass audit trails. Operators could replay the same tool calls agents used, which accelerated trust during rollout.
Example runbook metadata (illustrative, no production secrets):
playbook: dcgm-ecc-drain-reset
version: 1.2.0
preconditions:
- node.state in [IDLE, COMPLETING]
- active_jobs == 0
steps:
- slurm.drain(reason: "DCGM ECC alert {{alert_id}}")
- dcgm.reset_gpu(bus_id)
- health.verify_gpu(timeout: 300s)
rollback:
- slurm.resume()
approval: hitl5. Evaluation & rollout
| Phase | Duration | Success criteria | Rollback trigger |
|---|---|---|---|
| Shadow | [TBD: N weeks] | Agent recommendations match operator actions ≥ [TBD]% | N/A (no mutations) |
| Canary | [TBD: N weeks] | MTTR ↓ on canary rack; zero wrongful drains | Any tenant SLA breach |
| Fleet segment | [TBD] | Escalation rate ↓; playbook coverage ≥ [TBD]% of incidents | Error budget exhausted |
| Handoff | [TBD: N weeks] | Internal team runs playbooks without FDE | — |
Regression checks ran nightly against synthetic alert fixtures to catch playbook drift when SLURM or DCGM versions changed.
6. Results
| Metric | Baseline | After rollout | Method |
|---|---|---|---|
| MTTR (median) | [TBD] | [TBD] | Incident open → node healthy |
| Manual escalations / week | [TBD] | [TBD] | P1/P2 platform tickets |
| Playbook executions / week | 0 | [TBD] | Approved automation only |
| False positive drains | — | [TBD] | Jobs incorrectly preempted |
All values require client sign-off before replacing [TBD] on the public site.
7. Limitations & future work
- Heterogeneous hardware: Playbooks tuned on one GPU generation may need retuning for newer accelerators.
- Agent reliability: Long-horizon agent sessions are not a substitute for tested playbooks; we treated agents as orchestrators over deterministic steps.
- Org change: Sustained benefit depends on internal owners for playbook PRs and on-call training.
- Future: Deeper integration with capacity forecasting to proactively drain nodes before scheduled maintenance windows.
8. GI services & engagement model
GI Consulting embedded as Forward Deployed Engineering: paired with platform SRE and leadership, shipped incremental automation with weekly demos, and transferred runbooks + MCP tool definitions at handoff. This case maps to Cloud & systems architecture, High-performance & advanced computing, and the homepage FDE narrative.
Discuss a similar engagement: jason@giconsulting.net
References
- NVIDIA Data Center GPU Manager (DCGM) documentation — GPU telemetry and health monitoring.
- SchedMD SLURM — workload manager node states and scheduling.
- Model Context Protocol (MCP) — tool exposure pattern for agentic workflows.
- Internal runbook catalog (client confidential — not reproduced here).
Anonymized case study. Metrics marked [TBD] pending client validation. Last updated 2026-05-21.