Why GRPO matters for LLM post-training

Fall 2025 research spanning PQC cryptographic discovery across 12 codebases and a GRPO security post-training proof-of-concept on Qwen3-4B, with a long-horizon agent framework design for compliance workflows.

Security research partner (anonymized) · ~7 min read

Anonymized case study. Metrics marked [TBD] pending client validation. Status: draft.

At a glance

PQC findings (12 codebases)

39,572 [Measured]

Multi-stage static analysis Stages 1–3

False-positive reduction vs grep

78% [Measured]

Multi-stage pipeline vs naive pattern matching

Recall on known vulnerable patterns

95% [Measured]

Held-out vulnerable pattern set

GRPO internal reward (peak)

0.84 → 1.41 (67% lift) [Measured/Derived]

5 examples, Qwen3-4B-Base + LoRA, ~2 h T4

External CTI benchmark

Not evaluated

Honest gap; CyberSecQwen cited only as post-hoc anchor in full report

Problem

Post-quantum cryptography migration requires inventorying legacy cryptographic implementations across large, heterogeneous codebases—grep-based approaches generate noise that overwhelms remediation planning. Security teams also need to evaluate whether compact open-weight models can support cryptographic inspection and tool-using agent workflows without frontier-scale API costs.

No turnkey solution connected PQC inventory outputs to a locally trainable 4B-class security model. Existing evaluation focused on isolated prompts and external CTI benchmarks rather than reproducible post-training pipelines on accessible hardware with honest scale reporting.

Approach

Multi-stage PQC discovery pipeline — Static pattern matching, semantic AST analysis, and dependency vulnerability scanning across 12 open-source and enterprise-representative codebases (Stages 1–3).
GRPO security post-training — LoRA fine-tuning of Qwen3-4B-Base with group-relative policy optimization on a hand-coded security reward rubric; parallel SFT track for comparison.
Long-horizon agent architecture design — Six-layer framework integrating LangGraph-style orchestration, vector memory, and security analysis tools; production deployment documented but not executed.
Honest evaluation boundaries — Internal reward metrics only; external CTI and agent benchmarks explicitly out of scope for the fall 2025 deliverable.

Results

Preview metrics from measured fall 2025 artifacts. Client validation pending before marking metrics as final.

PQC pipeline produced 39,572 findings with 78% false-positive reduction and 95% recall on known vulnerable patterns across 12 codebases.
GRPO training on 5 security examples improved internal reward from 0.84 to 1.41 peak—a 67% relative lift—over ~2 hours on a T4 GPU.
Six-layer agent architecture and tool surfaces documented for a subsequent production phase; Stage 4 LLM validation and deployment remain design-stage only.

How GI Consulting can help

Read the full technical report →Discuss an engagement