Why GRPO matters for LLM post-training·All case studies

Draft preview. This report is in progress. Only the Abstract and Section 1 (Context and constraints) contain final prose. Sections 2–8 are placeholders pending completion of the canonical technical report.

Security & compliance · PQC discovery · GRPO post-training

Security and Compliance LLM Research: PQC Discovery, GRPO Post-Training, and Agent Framework Design

Jason Larkin · Fall 2025 · ~45 min read

Fall 2025 technical report · Security and compliance research partner (anonymized)

Tagging legend

Every numeric or capability claim in this report carries one of the following tags. Tags appear inline immediately after the claim they qualify.

TagDefinitionExamples in this engagement
MeasuredDirectly observed in llm_training artifacts, run logs, or primary evaluation outputs39,572 findings; 78% false-positive reduction; GRPO reward 0.84→1.41; 5 GRPO training examples
DerivedComputed from Measured values or faithfully paraphrased from cited external sources67% relative reward lift from baseline and peak; precision 78% computed from grep vs framework comparison
ProjectedExtrapolation or counterfactual without local measurementOur GRPO checkpoint on CTI-Bench; model scaling to 7B/14B/32B
Designed-not-runArchitecture or pipeline stage documented but not executed at scaleStage 4 LLM validation; production agent deployment; METR long-horizon evaluation

Timeline note: Work described as fall 2025 contemporaneous includes PQC static analysis Stages 1–3, the GRPO proof-of-concept, and agent framework design. External comparison anchors released after fall 2025 are cited only as post-hoc validation references (see footnote in Section 1).

Abstract

Organizations preparing for post-quantum cryptography (PQC) migration must inventory legacy cryptographic implementations across large, heterogeneous codebases while simultaneously evaluating whether domain-specialized language models can support security and compliance workflows. We describe a fall 2025 research and engineering engagement spanning two complementary lanes: a multi-stage static analysis pipeline for cryptographic discovery, and a proof-of-concept security post-training program on Qwen3-4B with an accompanying long-horizon agent framework design.

In the discovery lane, we executed Stages 1–3 of a four-stage framework—static pattern matching, semantic abstract-syntax-tree analysis, and dependency vulnerability scanning—across 12 [Measured] codebases totaling 12,000 [Measured] analyzed files. The pipeline produced 39,572 [Measured] findings with 78% [Measured] false-positive reduction versus grep-based baselines while maintaining 95% [Measured] recall on known vulnerable patterns. Stage 4 LLM validation, intended for context-aware false-positive filtering, is documented but was not executed at scale [Designed-not-run].

In the model lane, we fine-tuned Qwen3-4B-Base with LoRA adapters (1.62% [Measured] trainable parameters) using Group Relative Policy Optimization on 5 [Measured] security-focused examples. Internal reward scores improved from 0.84 [Measured] to 1.41 [Measured] peak—a 67% [Derived] relative lift—over approximately 2 [Measured] hours on a T4 GPU. No external benchmark evaluation was performed on the resulting checkpoint. We also specified a six-layer agent architecture integrating LangGraph-style orchestration, vector memory, and security analysis tools; production deployment of that architecture remains [Designed-not-run].

Together, these lanes establish measurable cryptographic inventory capability, a reproducible GRPO training pipeline for security-aware reasoning, and a design foundation for agentic PQC migration workflows. Subsequent sections detail methodology, evaluation design, results, and limitations.

1. Context and constraints

The engagement was conducted during fall 2025 with a security and compliance research partner (anonymized). The partner operates at the intersection of regulatory readiness, cryptographic modernization, and exploratory use of small language models for defensive security tasks. Public materials omit client name, proprietary codebase identifiers, and internal compliance framework names.

1.1 Problem framing

Two problems motivated the work. First, PQC migration requires systematic discovery of legacy cryptographic protocols—MD5, SHA-1, weak RSA key sizes, deprecated TLS configurations—embedded across polyglot enterprise codebases. Traditional grep-based inventory scales poorly: it generates noise that overwhelms remediation planning. Second, security teams increasingly ask whether compact open-weight models can be post-trained for cryptographic inspection, threat-context reasoning, and tool-using agent workflows without the cost and opacity of frontier-scale APIs. Fall 2025 industry context included NIST PQC standardization momentum, growing CTI-Bench and Foundation-Sec evaluation protocols for cybersecurity language models, and early agent benchmarks (SWE-Gym, SoRFT) for code-repair tasks—but no turnkey solution connecting PQC inventory outputs to a locally trainable 4B-class security model.

1.2 Success criteria

Partner leadership defined success along three axes:

  1. Discovery lane. Deliver inventory-grade cryptographic findings across a diverse corpus of open-source and enterprise-representative codebases, with quantified precision and recall improvements over naive pattern matching. Target: multi-stage pipeline operational through dependency scanning (Stages 1–3), with per-codebase results suitable for migration prioritization.
  2. Model lane. Demonstrate end-to-end ownership of a security-focused post-training pipeline—from base model selection through LoRA fine-tuning and reward-shaped GRPO—showing monotonic improvement on an internal security reward rubric. Target: reproducible training artifacts and honest reporting of scale limits; external CTI or agent benchmarks explicitly out of scope for the fall 2025 deliverable.
  3. Agent lane (design). Specify an architecture connecting discovery outputs, trained checkpoints, and security tooling (static analysis, cryptographic inspection, vector memory) into a long-horizon workflow suitable for a subsequent production phase. Target: documented layers, tool surfaces, and guardrails—not deployed autonomous remediation.

1.3 Constraints

Operational constraints shaped what was feasible:

  • Compute. Training ran on single T4 GPU instances (~2 [Measured] hours for GRPO; ~2.44 [Measured] hours for a parallel 59-example [Measured] standard SFT track). No multi-node or MI300X-scale runs were budgeted.
  • Data. Evaluation codebases were public or open-source representatives (12 [Measured] total); no proprietary customer repositories were analyzed.
  • Anonymization. This report uses anonymized partner labeling and excludes fleet-scale or tenant-specific identifiers, consistent with GI Consulting case-study conventions for sensitive engagements.
  • Honesty boundary. Numbers from releases that post-date fall 2025—external CTI specialist models and SWE agent fine-tunes—may inform extrapolation and follow-on validation but are not claimed as contemporaneous deliverables.1

1.4 Explicit scope limits

The following boundaries apply throughout this report:

In scope (fall 2025)Out of scope or not executed
PQC static analysis Stages 1–3 across 12 [Measured] codebases; 39,572 [Measured] findings; 78% [Measured] FP reduction; 95% [Measured] recallStage 4 LLM validation at scale [Designed-not-run]
GRPO POC: 5 [Measured] examples, Qwen3-4B-Base + LoRA, internal reward 0.84→1.41 [Measured]CTI-Bench, SecBench, CyberMetric, or agent benchmark scores on our checkpoint [Projected]
Parallel SFT track: 59 [Measured] examples, loss 0.64→0.18 [Measured]Full-parameter continued pretraining at Foundation-Sec token scale
Six-layer agent architecture design; 14 [Measured] long-horizon diagram topics; LangGraph/LangChain/Chroma patternsProduction agent deployment; METR long-horizon hours-to-success eval [Designed-not-run]
Architecture docs linking PQC migration phases to tool integrationAutonomous remediation without human-in-the-loop gates [Designed-not-run]

Stage 4 LLM validation—envisioned as a relevancy classifier analogous to Foundation-Sec's document filter—is architecturally specified but marked [Designed-not-run] everywhere it appears. The GRPO checkpoint improved on a hand-coded security reward only; it must not be read as benchmark-competitive security reasoning without external evaluation.

A parallel standard fine-tuning run (59 [Measured] examples, cross-entropy loss) provides a general-purpose comparison track but is secondary to the GRPO security POC in this narrative.

2. Failure modes and observability

Preview. This section will cover legacy cryptographic blind spots, grep and Semgrep false-positive noise, CTI evaluation gaps, and signal-to-response mapping tables. Content will be adapted from the production handoff mapping for Failure modes and observability (production handoff Section 2, "Data collection pipeline" row and GI Sec 2 template), drawing on PQC precision/recall findings and honest gaps in external benchmark coverage.

3. Architecture and remediation loop

Preview. This section will cover the four-stage PQC analysis funnel (Stages 1–3 Measured; Stage 4 Designed-not-run), Qwen3-4B base model and LoRA configuration, and the six-layer long-horizon agent architecture. Content will follow the production handoff Architecture and remediation loop mapping (production handoff Sections 2–3, Figure Plan Fig 2 and Fig 4), reusing pipeline and security framework visualizations from docs/visualizations/.

4. Agentic tooling

Preview. This section will cover GRPO training stack mechanics, programmatic security rewards, LangGraph-style orchestration, vector memory, and MCP-style security tool surfaces. Content will follow the production handoff Agentic tooling mapping (production handoff Section 2 agent row and Section 7 step 12), with production deployment marked Designed-not-run throughout.

5. Evaluation and rollout

Preview. This section will cover PQC static analysis metrics (precision, recall, false-positive reduction), internal GRPO reward rubric design, Foundation-Sec evaluation protocol as a structural reference, and phased rollout tables. Content will follow the production handoff Evaluation and rollout mapping (production handoff Section 2 evaluation row and Section 7 steps 8–9), with external CTI benchmarks explicitly labeled Projected or post-hoc where cited.

6. Results

Preview. This section will present measured PQC codebase findings tables, grep-vs-framework comparisons, GRPO and SFT training curves, and a post-hoc CyberSecQwen CTI anchor table with May 2026 disclaimer. Content will follow the production handoff Results mapping (production handoff Section 2 results rows, Figure Plan Tables B/C and Table A).

7. Limitations and future work

Preview. This section will document Stage 4 LLM validation not executed, five-example GRPO scale limits, absence of CTI evaluation on our checkpoint, agent tools remaining design-stage, and domain-specialization tradeoffs. Content will follow the production handoff Limitations and future work mapping (production handoff Section 2 limitations row and Section 7 step 14).

GI services & engagement model

GI Consulting embedded as Forward Deployed Engineering: paired with research leadership, shipped incremental discovery and training artifacts with weekly demos, and transferred architecture documentation at handoff. This case maps to Security & compliance and AI & agentic system design services.

Discuss a similar engagement: jason@giconsulting.net

References

Preview. This section will compile bibliographic entries from docs/technical_paper/references.bib and the external anchor registry in docs/external_benchmark_anchors.md, including Foundation-Sec, CTI-Bench, CyberSecQwen, Oxen GRPO, and local llm_training documentation. Content will follow the production handoff References mapping (production handoff Section 2 references row and Section 7 step 15).

  1. Post-hoc validation anchors. CyberSecQwen-4B (released May 2026) provides an external CTI-Bench LoRA SFT reference under the Foundation-Sec evaluation protocol; SWE-Lego Qwen3-8B (arXiv January 2026) provides an agent-lane scaling reference on SWE-Bench Verified. Neither existed during fall 2025. They are cited in later sections only as validation or extrapolation anchors, not as engagement deliverables.

Anonymized case study. Draft preview — sections 2–8 forthcoming. Last updated 2026-05-25.