Agentic GPU fleet remediation·All case studies

FDE · Production agentic ops

Agentic GPU fleet remediation

Jason Larkin · 2026-05-21 · ~18 min read

Reduced GPU idle time and incident MTTR via SLURM-aware auto-remediation and observability integrated with agentic ops tooling.

Abstract

GPU cloud operators face a volume of node-level incidents that outpaces manual triage. We describe an embedded forward-deployed engagement with a Tier-1 GPU cloud provider to map SLURM and DCGM failure modes, wire observability to structured remediation playbooks, and integrate MCP-accessible tooling so both human operators and agentic workflows could close the loop from alert to verified recovery. The system prioritizes guardrails: destructive actions require explicit approval, and automation rolled out through shadow and canary phases. Quantitative MTTR and escalation improvements are reported as [TBD] pending client validation. We discuss limitations around fleet heterogeneity and the need for ongoing playbook maintenance as hardware generations change.

1. Context & constraints

The provider operates a multi-tenant GPU fleet scheduled by SLURM. Tenants range from long-running distributed training jobs to bursty inference workloads. Platform leadership defined success as:

  • Reduce time nodes spend in unhealthy or drained states without increasing risk to running jobs.
  • Lower manual escalations for known fault signatures.
  • Leave behind runbooks and tooling the internal team could operate without permanent FDE staffing.

Constraints included: no broad SSH access for automated agents, change windows aligned to tenant SLAs, and anonymized public storytelling (no client name, fleet size, or datacenter identifiers in marketing materials).

2. Failure modes & observability

We grouped recurring incidents into fault classes before designing automation.

SignalSymptomPrior manual responseAutomation candidate
DCGM GPU XID / ECCJob failures, health check redDrain node, reset GPU, verify DCGMConditional drain + reset playbook with job-awareness
SLURM node DOWN / DRAINQueue backlog, scheduling gapsInspect slurmctld logs, power cycleScripted health probe → drain reason annotation
DCGM thermal / powerThrottling, perf regressionMigrate jobs, hardware ticketAlert enrichment + Jira template
Intermittent fabric / NICNCCL timeoutsIsolate node, network teamRead-only diag bundle attachment to ticket
Stale compute stateGhost jobs, cgroups mismatchManual scontrol cleanupGuarded cleanup playbook (HITL required)

Grafana dashboards were refactored so each alert included: node identity, active job count, last successful playbook run, and deep links to Jira if an incident already existed.

3. Architecture & remediation loop

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│ DCGM/Grafana│────▶│ Alert + ctx  │────▶│ Agent / operator│
└─────────────┘     └──────────────┘     └────────┬────────┘
                                                    │
                    ┌──────────────┐                ▼
                    │ Verify +     │◀─────── MCP tool layer
                    │ close Jira   │         (metrics, tickets,
                    └──────▲───────┘          remediation CLI)
                           │
                    ┌──────┴───────┐
                    │ Runbook      │
                    │ (guardrails) │
                    └──────┬───────┘
                           ▼
                    ┌──────────────┐
                    │ SLURM drain/ │
                    │ restore      │
                    └──────────────┘
Figure 1: Closed-loop remediation — Metrics (Grafana/DCGM) → Triage (operator or agent) → MCP tools → Runbook executor → SLURM action → Verification → Jira update.

Guardrails

  • Read-only tools (metrics query, log fetch, ticket read) available to agents without approval.
  • State-changing playbooks require role-based approval or pre-approved canary node lists.
  • Every execution logs: initiator (human vs agent), playbook version, pre/post node state.

4. Agentic tooling integration

Tooling was exposed through MCP-compatible interfaces so agent frameworks and operator CLIs shared capabilities:

ToolCapabilityAgent autonomy
Grafana queryPromQL/label-scoped metricsAutonomous read
JiraCreate/update incident, attach diagAutonomous create; close requires human
SLURM readNode/job state via controlled APIAutonomous read
Remediation CLIDrain, GPU reset, health probeHITL or canary-only
Runbook catalogList/playbook metadataAutonomous read

This avoided one-off agent prompts that bypass audit trails. Operators could replay the same tool calls agents used, which accelerated trust during rollout.

Example runbook metadata (illustrative, no production secrets):

playbook: dcgm-ecc-drain-reset
version: 1.2.0
preconditions:
  - node.state in [IDLE, COMPLETING]
  - active_jobs == 0
steps:
  - slurm.drain(reason: "DCGM ECC alert {{alert_id}}")
  - dcgm.reset_gpu(bus_id)
  - health.verify_gpu(timeout: 300s)
rollback:
  - slurm.resume()
approval: hitl

5. Evaluation & rollout

PhaseDurationSuccess criteriaRollback trigger
Shadow[TBD: N weeks]Agent recommendations match operator actions ≥ [TBD]%N/A (no mutations)
Canary[TBD: N weeks]MTTR ↓ on canary rack; zero wrongful drainsAny tenant SLA breach
Fleet segment[TBD]Escalation rate ↓; playbook coverage ≥ [TBD]% of incidentsError budget exhausted
Handoff[TBD: N weeks]Internal team runs playbooks without FDE

Regression checks ran nightly against synthetic alert fixtures to catch playbook drift when SLURM or DCGM versions changed.

6. Results

MetricBaselineAfter rolloutMethod
MTTR (median)[TBD][TBD]Incident open → node healthy
Manual escalations / week[TBD][TBD]P1/P2 platform tickets
Playbook executions / week0[TBD]Approved automation only
False positive drains[TBD]Jobs incorrectly preempted

All values require client sign-off before replacing [TBD] on the public site.

7. Limitations & future work

  • Heterogeneous hardware: Playbooks tuned on one GPU generation may need retuning for newer accelerators.
  • Agent reliability: Long-horizon agent sessions are not a substitute for tested playbooks; we treated agents as orchestrators over deterministic steps.
  • Org change: Sustained benefit depends on internal owners for playbook PRs and on-call training.
  • Future: Deeper integration with capacity forecasting to proactively drain nodes before scheduled maintenance windows.

8. GI services & engagement model

GI Consulting embedded as Forward Deployed Engineering: paired with platform SRE and leadership, shipped incremental automation with weekly demos, and transferred runbooks + MCP tool definitions at handoff. This case maps to Cloud & systems architecture, High-performance & advanced computing, and the homepage FDE narrative.

Discuss a similar engagement: jason@giconsulting.net

References

  1. NVIDIA Data Center GPU Manager (DCGM) documentation — GPU telemetry and health monitoring.
  2. SchedMD SLURM — workload manager node states and scheduling.
  3. Model Context Protocol (MCP) — tool exposure pattern for agentic workflows.
  4. Internal runbook catalog (client confidential — not reproduced here).

Anonymized case study. Metrics marked [TBD] pending client validation. Last updated 2026-05-21.