FDE · Production agentic ops

Agentic GPU fleet remediation

Jason Larkin · 2026-05-21 · ~18 min read

Reduced GPU idle time and incident MTTR via SLURM-aware auto-remediation and observability integrated with agentic ops tooling.

Abstract

GPU cloud operators face a volume of node-level incidents that outpaces manual triage. We describe an embedded forward-deployed engagement with a Tier-1 GPU cloud provider to map SLURM and DCGM failure modes, wire observability to structured remediation playbooks, and integrate MCP-accessible tooling so both human operators and agentic workflows could close the loop from alert to verified recovery. The system prioritizes guardrails: destructive actions require explicit approval, and automation rolled out through shadow and canary phases. Quantitative MTTR and escalation improvements are reported as [TBD] pending client validation. We discuss limitations around fleet heterogeneity and the need for ongoing playbook maintenance as hardware generations change.

1. Context & constraints

The provider operates a multi-tenant GPU fleet scheduled by SLURM. Tenants range from long-running distributed training jobs to bursty inference workloads. Platform leadership defined success as:

Reduce time nodes spend in unhealthy or drained states without increasing risk to running jobs.
Lower manual escalations for known fault signatures.
Leave behind runbooks and tooling the internal team could operate without permanent FDE staffing.

Constraints included: no broad SSH access for automated agents, change windows aligned to tenant SLAs, and anonymized public storytelling (no client name, fleet size, or datacenter identifiers in marketing materials).

2. Failure modes & observability

We grouped recurring incidents into fault classes before designing automation.

Signal	Symptom	Prior manual response	Automation candidate
DCGM GPU XID / ECC	Job failures, health check red	Drain node, reset GPU, verify DCGM	Conditional drain + reset playbook with job-awareness
SLURM node DOWN / DRAIN	Queue backlog, scheduling gaps	Inspect slurmctld logs, power cycle	Scripted health probe → drain reason annotation
DCGM thermal / power	Throttling, perf regression	Migrate jobs, hardware ticket	Alert enrichment + Jira template
Intermittent fabric / NIC	NCCL timeouts	Isolate node, network team	Read-only diag bundle attachment to ticket
Stale compute state	Ghost jobs, cgroups mismatch	Manual scontrol cleanup	Guarded cleanup playbook (HITL required)

Grafana dashboards were refactored so each alert included: node identity, active job count, last successful playbook run, and deep links to Jira if an incident already existed.

3. Architecture & remediation loop

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│ DCGM/Grafana│────▶│ Alert + ctx  │────▶│ Agent / operator│
└─────────────┘     └──────────────┘     └────────┬────────┘
                                                    │
                    ┌──────────────┐                ▼
                    │ Verify +     │◀─────── MCP tool layer
                    │ close Jira   │         (metrics, tickets,
                    └──────▲───────┘          remediation CLI)
                           │
                    ┌──────┴───────┐
                    │ Runbook      │
                    │ (guardrails) │
                    └──────┬───────┘
                           ▼
                    ┌──────────────┐
                    │ SLURM drain/ │
                    │ restore      │
                    └──────────────┘

Figure 1: Closed-loop remediation — Metrics (Grafana/DCGM) → Triage (operator or agent) → MCP tools → Runbook executor → SLURM action → Verification → Jira update.

Guardrails

Read-only tools (metrics query, log fetch, ticket read) available to agents without approval.
State-changing playbooks require role-based approval or pre-approved canary node lists.
Every execution logs: initiator (human vs agent), playbook version, pre/post node state.

4. Agentic tooling integration

Tooling was exposed through MCP-compatible interfaces so agent frameworks and operator CLIs shared capabilities:

Tool	Capability	Agent autonomy
Grafana query	PromQL/label-scoped metrics	Autonomous read
Jira	Create/update incident, attach diag	Autonomous create; close requires human
SLURM read	Node/job state via controlled API	Autonomous read
Remediation CLI	Drain, GPU reset, health probe	HITL or canary-only
Runbook catalog	List/playbook metadata	Autonomous read

This avoided one-off agent prompts that bypass audit trails. Operators could replay the same tool calls agents used, which accelerated trust during rollout.

Example runbook metadata (illustrative, no production secrets):

playbook: dcgm-ecc-drain-reset
version: 1.2.0
preconditions:
  - node.state in [IDLE, COMPLETING]
  - active_jobs == 0
steps:
  - slurm.drain(reason: "DCGM ECC alert {{alert_id}}")
  - dcgm.reset_gpu(bus_id)
  - health.verify_gpu(timeout: 300s)
rollback:
  - slurm.resume()
approval: hitl

5. Evaluation & rollout

Phase	Duration	Success criteria	Rollback trigger
Shadow	[TBD: N weeks]	Agent recommendations match operator actions ≥ [TBD]%	N/A (no mutations)
Canary	[TBD: N weeks]	MTTR ↓ on canary rack; zero wrongful drains	Any tenant SLA breach
Fleet segment	[TBD]	Escalation rate ↓; playbook coverage ≥ [TBD]% of incidents	Error budget exhausted
Handoff	[TBD: N weeks]	Internal team runs playbooks without FDE	—

Regression checks ran nightly against synthetic alert fixtures to catch playbook drift when SLURM or DCGM versions changed.

6. Results

Metric	Baseline	After rollout	Method
MTTR (median)	[TBD]	[TBD]	Incident open → node healthy
Manual escalations / week	[TBD]	[TBD]	P1/P2 platform tickets
Playbook executions / week	0	[TBD]	Approved automation only
False positive drains	—	[TBD]	Jobs incorrectly preempted

All values require client sign-off before replacing [TBD] on the public site.

7. Limitations & future work

Heterogeneous hardware: Playbooks tuned on one GPU generation may need retuning for newer accelerators.
Agent reliability: Long-horizon agent sessions are not a substitute for tested playbooks; we treated agents as orchestrators over deterministic steps.
Org change: Sustained benefit depends on internal owners for playbook PRs and on-call training.
Future: Deeper integration with capacity forecasting to proactively drain nodes before scheduled maintenance windows.

8. GI services & engagement model

GI Consulting embedded as Forward Deployed Engineering: paired with platform SRE and leadership, shipped incremental automation with weekly demos, and transferred runbooks + MCP tool definitions at handoff. This case maps to Cloud & systems architecture, High-performance & advanced computing, and the homepage FDE narrative.

Discuss a similar engagement: jason@giconsulting.net

References

NVIDIA Data Center GPU Manager (DCGM) documentation — GPU telemetry and health monitoring.
SchedMD SLURM — workload manager node states and scheduling.
Model Context Protocol (MCP) — tool exposure pattern for agentic workflows.
Internal runbook catalog (client confidential — not reproduced here).

Anonymized case study. Metrics marked [TBD] pending client validation. Last updated 2026-05-21.