LLM Reasoning Revolution: How Power Sampling Unlocks the Hidden Genius of Base Models (Training-Free)

The enhancement of reasoning capabilities in frontier Large Language Models (LLMs) has historically relied heavily on Reinforcement Learning (RL). State-of-the-art techniques, such such as Group Relative Policy Optimization (GRPO), have been successfully used in post-training to achieve significant performance gains across challenging domains like mathematics (MATH500), coding (HumanEval), and science (GPQA).

However, a recent groundbreaking paper from Harvard, Reasoning with Sampling: Your Base Model is Smarter Than You Think, presents a revolutionary finding: reasoning capabilities comparable to, or even exceeding, those achieved by RL, can be elicited from base models solely through smarter sampling during inference, requiring no additional training.

The core insight driving this discovery is the concept of "distribution sharpening".

1. The Distribution Sharpening Paradigm

Research indicates that the improvements seen after RL post-training are not fundamentally novel behaviors but rather a "sharper" version of the base model distribution.

The RL Effect: After RL training, the model's confidence distribution shifts significantly to the right. The outputs become "super confident" (token by token) and tightly concentrated in high-likelihood/confidence regions.
The Trade-off: While this sharpening leads to excellent single-shot reasoning performance, it causes a noticeable collapse in generation diversity in multi-shot scenarios (i.e., $pass@k$), where base models might even outperform RL-posttrained models for large $k$.

The authors sought to achieve the sharpening effect without the limitations and costs of RL (training-free, dataset-free, verifier-free).

2. Power Sampling: Targeting the Power Distribution

The proposed solution is Power Sampling, a simple iterative sampling algorithm.

2.1. The Target: The Power Distribution ($p^\alpha$)

Power Sampling aims to sample from the power distribution $p^\alpha$, where $\alpha \in [1, \infty]$ is the sharpening factor. Exponentiating the base model’s sequence probability $p$ increases the relative weight placed on sequences with higher likelihood. Empirically, $\alpha = 4.0$ was found to be the most performant for reasoning tasks.

2.2. Critical Distinction from Low-Temperature Sampling

It is crucial to understand that sampling from $p^\alpha$ is not equivalent to low-temperature sampling ($\tau = 1/\alpha$).

Low-Temperature Sampling (LTS): Only exponentiates the conditional next-token distributions. It operates "greedily" by averaging over future likelihoods.
Power Sampling (PS): By sampling from $p^\alpha$, it inherently accounts for all future completions. This method upweights tokens that have "few but high likelihood future paths," unlike LTS, which might favor tokens with "several lower likelihood completions". This bias towards high-likelihood completions is immensely valuable for reasoning tasks, as it provides an implicit bias towards planning.

2.3. MCMC Implementation (Metropolis-Hastings)

Since direct sampling from $p^\alpha$ is computationally intractable (due to the necessity of normalizing over all sequences), the algorithm employs a Markov Chain Monte Carlo (MCMC) technique, specifically the Metropolis-Hastings (MH) algorithm.

The MCMC algorithm:

Block-wise Iteration: It leverages the sequential structure of autoregressive generation by running the MCMC process in blocks of tokens (e.g., $B=192$).
Candidate Generation: In each step, a random index is selected, and a candidate sequence is generated by resampling the subsequent subsequence (using a proposal function $q$).
Acceptance/Rejection: The candidate sequence is accepted or rejected based on an acceptance ratio (Equation 9). This ratio uses the relative weights $p^\alpha$, guaranteeing that the chain converges to the target distribution $p^\alpha$.

The Power Sampling algorithm (Algorithm 1) introduces a sequence of intermediate distributions to mitigate the issue of exponentially long mixing times common in high-dimensional MCMC processes.

3. Stunning Empirical Results and Key Advantages

The results demonstrate that Power Sampling achieves massive, near-universal boosts in single-shot accuracy, often matching or outperforming RL, especially on out-of-domain tasks.

Model	Method	MATH500 (In-Domain)	HumanEval (Out-of-Domain)	AlpacaEval 2.0 (General)
Qwen2.5-Math-7B	Base	49.6%	32.9%	1.61
Qwen2.5-Math-7B	GRPO (RL)	78.5%	53.7%	2.38
Qwen2.5-Math-7B	Power Sampling (Ours)	74.8%	57.3%	2.88

A. Massive Single-Shot Reasoning Boost:

MATH500: On Qwen2.5-Math-7B, Power Sampling increased accuracy from 49.6% to 74.8% (a jump of nearly 25%). This is on par with RL (GRPO, 78.5%) on this in-domain task.
HumanEval (Coding): On this out-of-domain task, Power Sampling achieved 57.3% (Qwen2.5-Math-7B), outperforming GRPO (53.7%). Similar results were observed across other models, such as Phi-3.5-mini-instruct (73.2% for PS vs. 13.4% for GRPO).
Generalizability (AlpacaEval 2.0): Power Sampling consistently outperformed RL (GRPO) on the non-verifiable general benchmark AlpacaEval 2.0, suggesting broad applicability beyond easily verifiable domains.

B. Preservation of Multi-Shot Diversity (Pass@k):

While RL (GRPO) suffers from a collapse in diversity leading to lower multi-shot performance, Power Sampling achieves $pass@k$ performance curves that strongly supersede both GRPO and the base model for $k > 1$. It secures the benefits of high single-shot accuracy while maintaining robust sample diversity.

4. Implications and Computational Trade-off

This technique establishes a new axis for inference-time scaling.

Training-Free Nature: The algorithm's primary benefit is that it requires no training, no curated datasets, and no external verifier. This circumvents the inherent weaknesses of RL methods, such as extensive hyperparameter tuning and training instabilities.
Inference Cost: Power Sampling requires additional computation during sampling (more token generations). For a typical sequence, the estimated cost multiplier of generated tokens is approximately 8.84x compared to standard inference. However, this inference-time cost is balanced by the immediate and substantial performance boost achieved by unlocking latent capabilities in existing models.

In summary, this research illustrates that the latent reasoning capabilities within base LLMs are significantly underutilized by current sampling methodologies. By employing Power Sampling, organizations can immediately boost the performance of any base model, including those fine-tuned on specific corporate data or tasks.

References

Paper link https://arxiv.org/pdf/2510.14901
Official repo: https://github.com/aakaran/reasoning-with-sampling