HeadlinesBriefing favicon HeadlinesBriefing.com

PopuLoRA Boosts LLM Reasoning via Adaptive Self-Play

Hacker News •
×

Researchers from Vmax unveil PopuLoRA, a framework for reinforcement learning with verifiable rewards (RLVR) post-training. RLVR trains models by rewarding correct solutions to checkable tasks like code execution, but fixed task sets often become too easy. PopuLoRA addresses this by separating task generation from solving, using co-evolving teacher and student populations.

Single-agent self-play collapses as generators produce only solvable tasks. PopuLoRA pairs teachers with students: teachers earn rewards for valid tasks that stump their paired student, creating an inter-population challenge. This asymmetric design sustains a dynamic curriculum, preventing the stagnation seen when one model both proposes and attempts problems.

Built on LoRA adapters, PopuLoRA trains efficiently on a single machine. Teachers and students co-evolve via policy-gradient RL and weight-space evolution, with a verifier filtering invalid programs. Prioritized matching ensures near-balanced contests, driving continuous gains in task complexity and solving prowess without manual intervention.

In experiments, PopuLoRA generates increasingly complex code tasks, while single-agent baselines simplify. By decoupling generation from solving, this approach offers a scalable method to enhance LLM reasoning through an autocurriculum that adapts in real time, potentially reducing reliance on hand-crafted training data.