HeadlinesBriefing favicon HeadlinesBriefing.com

Tree Search Boosts Language Model Reasoning

Hacker News •
×

Researchers are exploring whether Tree Search Distillation techniques, successful in game-playing AI like AlphaZero, can improve language model reasoning. The author applied MCTS to Qwen-2.5-1.5B-Instruct on the Countdown arithmetic game, using an online PPO loop to distill stronger trajectories back into the model. This approach aims to enhance reasoning capabilities beyond standard language RL methods.

The distilled model achieved an impressive 11.3% mean@16 eval score on Countdown, outperforming CISPO (8.4%) and best-of-N (7.7%). This represents an 8.2 percentage point improvement over the pre-RL instruct model's 3.1%. The researcher found sparse rewards led to unstable training, so they implemented a custom dense reward function during training while using sparse rewards for evaluation.

Experiments ran on an 8xH100 Andromeda node with infrastructure built using Rust workers and Redis. The implementation used parallel MCTS with virtual losses for search diversity, differing from similar approaches like TS-LLM by combining online RL with parallel search. The author plans to scale these experiments to larger models and compute budgets in future work.