← All case studies

Agentic AIsecurity

Why GRPO matters for LLM post-training

Fall 2025 research spanning PQC cryptographic discovery across 12 codebases and a GRPO security post-training proof-of-concept on Qwen3-4B, with a long-horizon agent framework design for compliance workflows.

Security research partner (anonymized) · ~7 min read

Anonymized case study. Metrics marked [TBD] pending client validation. Status: draft.

At a glance

PQC findings (12 codebases)

39,572 [Measured]

Multi-stage static analysis Stages 1–3

False-positive reduction vs grep

78% [Measured]

Multi-stage pipeline vs naive pattern matching

Recall on known vulnerable patterns

95% [Measured]

Held-out vulnerable pattern set

GRPO internal reward (peak)

0.84 → 1.41 (67% lift) [Measured/Derived]

5 examples, Qwen3-4B-Base + LoRA, ~2 h T4

External CTI benchmark

Not evaluated

Honest gap; CyberSecQwen cited only as post-hoc anchor in full report

Problem

Post-quantum cryptography migration requires inventorying legacy cryptographic implementations across large, heterogeneous codebases—grep-based approaches generate noise that overwhelms remediation planning. Security teams also need to evaluate whether compact open-weight models can support cryptographic inspection and tool-using agent workflows without frontier-scale API costs.

No turnkey solution connected PQC inventory outputs to a locally trainable 4B-class security model. Existing evaluation focused on isolated prompts and external CTI benchmarks rather than reproducible post-training pipelines on accessible hardware with honest scale reporting.

Approach

  1. Multi-stage PQC discovery pipeline — Static pattern matching, semantic AST analysis, and dependency vulnerability scanning across 12 open-source and enterprise-representative codebases (Stages 1–3).
  2. GRPO security post-training — LoRA fine-tuning of Qwen3-4B-Base with group-relative policy optimization on a hand-coded security reward rubric; parallel SFT track for comparison.
  3. Long-horizon agent architecture design — Six-layer framework integrating LangGraph-style orchestration, vector memory, and security analysis tools; production deployment documented but not executed.
  4. Honest evaluation boundaries — Internal reward metrics only; external CTI and agent benchmarks explicitly out of scope for the fall 2025 deliverable.

Results

Preview metrics from measured fall 2025 artifacts. Client validation pending before marking metrics as final.

  • PQC pipeline produced 39,572 findings with 78% false-positive reduction and 95% recall on known vulnerable patterns across 12 codebases.
  • GRPO training on 5 security examples improved internal reward from 0.84 to 1.41 peak—a 67% relative lift—over ~2 hours on a T4 GPU.
  • Six-layer agent architecture and tool surfaces documented for a subsequent production phase; Stage 4 LLM validation and deployment remain design-stage only.