Published on 27/10/2025
The enhancement of reasoning capabilities in frontier Large Language Models (LLMs) has historically relied heavily on Reinforcement Learning (RL). State-of-the-art techniques, such such as Group Relative Policy Optimization (GRPO), have been successfully used in post-training to achieve significant performance gains across challenging domains like mathematics (MATH500), coding (HumanEval), and science (GPQA).
However, a recent groundbreaking paper from Harvard, Reasoning with Sampling: Your Base Model is Smarter Than You Think, presents a revolutionary finding: reasoning capabilities comparable to, or even exceeding, those achieved by RL, can be elicited from base models solely through smarter sampling during inference, requiring no additional training.
The core insight driving this discovery is the concept of "distribution sharpening".
Research indicates that the improvements seen after RL post-training are not fundamentally novel behaviors but rather a "sharper" version of the base model distribution.
The authors sought to achieve the sharpening effect without the limitations and costs of RL (training-free, dataset-free, verifier-free).
The proposed solution is Power Sampling, a simple iterative sampling algorithm.
Power Sampling aims to sample from the power distribution $p^\alpha$, where $\alpha \in [1, \infty]$ is the sharpening factor. Exponentiating the base model’s sequence probability $p$ increases the relative weight placed on sequences with higher likelihood. Empirically, $\alpha = 4.0$ was found to be the most performant for reasoning tasks.
It is crucial to understand that sampling from $p^\alpha$ is not equivalent to low-temperature sampling ($\tau = 1/\alpha$).
Since direct sampling from $p^\alpha$ is computationally intractable (due to the necessity of normalizing over all sequences), the algorithm employs a Markov Chain Monte Carlo (MCMC) technique, specifically the Metropolis-Hastings (MH) algorithm.
The MCMC algorithm:
The Power Sampling algorithm (Algorithm 1) introduces a sequence of intermediate distributions to mitigate the issue of exponentially long mixing times common in high-dimensional MCMC processes.
The results demonstrate that Power Sampling achieves massive, near-universal boosts in single-shot accuracy, often matching or outperforming RL, especially on out-of-domain tasks.
| Model | Method | MATH500 (In-Domain) | HumanEval (Out-of-Domain) | AlpacaEval 2.0 (General) |
|---|---|---|---|---|
| Qwen2.5-Math-7B | Base | 49.6% | 32.9% | 1.61 |
| Qwen2.5-Math-7B | GRPO (RL) | 78.5% | 53.7% | 2.38 |
| Qwen2.5-Math-7B | Power Sampling (Ours) | 74.8% | 57.3% | 2.88 |
While RL (GRPO) suffers from a collapse in diversity leading to lower multi-shot performance, Power Sampling achieves $pass@k$ performance curves that strongly supersede both GRPO and the base model for $k > 1$. It secures the benefits of high single-shot accuracy while maintaining robust sample diversity.
This technique establishes a new axis for inference-time scaling.
In summary, this research illustrates that the latent reasoning capabilities within base LLMs are significantly underutilized by current sampling methodologies. By employing Power Sampling, organizations can immediately boost the performance of any base model, including those fine-tuned on specific corporate data or tasks.
Paper link https://arxiv.org/pdf/2510.14901
Official repo: https://github.com/aakaran/reasoning-with-sampling