Case study
Agentic AIsecurity
Why GRPO matters for LLM post-training
Fall 2025 research spanning PQC cryptographic discovery across 12 codebases and a GRPO security post-training proof-of-concept on Qwen3-4B, with a long-horizon agent framework design for compliance workflows.
Security research partner (anonymized) · ~7 min read
Anonymized case study. Metrics marked [TBD] pending client validation. Status: draft.
At a glance
PQC findings (12 codebases)
39,572 [Measured]
Multi-stage static analysis Stages 1–3
False-positive reduction vs grep
78% [Measured]
Multi-stage pipeline vs naive pattern matching
Recall on known vulnerable patterns
95% [Measured]
Held-out vulnerable pattern set
GRPO internal reward (peak)
0.84 → 1.41 (67% lift) [Measured/Derived]
5 examples, Qwen3-4B-Base + LoRA, ~2 h T4
External CTI benchmark
Not evaluated
Honest gap; CyberSecQwen cited only as post-hoc anchor in full report
Problem
Post-quantum cryptography migration requires inventorying legacy cryptographic implementations across large, heterogeneous codebases—grep-based approaches generate noise that overwhelms remediation planning. Security teams also need to evaluate whether compact open-weight models can support cryptographic inspection and tool-using agent workflows without frontier-scale API costs.
No turnkey solution connected PQC inventory outputs to a locally trainable 4B-class security model. Existing evaluation focused on isolated prompts and external CTI benchmarks rather than reproducible post-training pipelines on accessible hardware with honest scale reporting.
Approach
- Multi-stage PQC discovery pipeline — Static pattern matching, semantic AST analysis, and dependency vulnerability scanning across 12 open-source and enterprise-representative codebases (Stages 1–3).
- GRPO security post-training — LoRA fine-tuning of Qwen3-4B-Base with group-relative policy optimization on a hand-coded security reward rubric; parallel SFT track for comparison.
- Long-horizon agent architecture design — Six-layer framework integrating LangGraph-style orchestration, vector memory, and security analysis tools; production deployment documented but not executed.
- Honest evaluation boundaries — Internal reward metrics only; external CTI and agent benchmarks explicitly out of scope for the fall 2025 deliverable.
Results
Preview metrics from measured fall 2025 artifacts. Client validation pending before marking metrics as final.
- PQC pipeline produced 39,572 findings with 78% false-positive reduction and 95% recall on known vulnerable patterns across 12 codebases.
- GRPO training on 5 security examples improved internal reward from 0.84 to 1.41 peak—a 67% relative lift—over ~2 hours on a T4 GPU.
- Six-layer agent architecture and tool surfaces documented for a subsequent production phase; Stage 4 LLM validation and deployment remain design-stage only.