DeepConf: Deep Think with Confidence

Executive Summary

DeepConf performs test-time filtering of low‑quality reasoning traces using model‑internal confidence, improving accuracy and cost [1].
Combines token‑level and group (sliding‑window) confidence to estimate local reasoning reliability [1].
Supports offline and online modes; enables confidence‑weighted majority voting and early‑stop filtering [1].
On AIME 2025, DeepConf@512 attains up to 99.9% accuracy and reduces generated tokens by up to 84.7% versus standard parallel thinking at the same budget [1].

Glossary

Token confidence: token probability (from logprobs) used as a reliability proxy [1].

Group confidence: aggregated confidence over a sliding window of adjacent tokens [1].

Tail/lowest‑group: tail statistics or minimum‑group confidence for a trace [1].

Top‑x% filter: keep only traces within the desired confidence quantile [1].

What DeepConf is and why it matters

DeepConf is a test‑time method that scores reasoning quality via internal confidence signals, discarding weak paths early and focusing budget on promising ones [1]. In multi‑trace settings (e.g., self‑consistency), this yields stronger decisions with fewer tokens [1].

How it works

Token and group confidence

Confidence is computed per token from model logprobs and aggregated over sliding windows to obtain more stable, local group confidence [1]. Statistics such as bottom‑10% groups, tail confidence, and lowest‑group confidence capture bottlenecks in the trace [1].

Offline vs online

Offline: generate multiple full traces, score them by confidence, and apply confidence‑weighted majority voting [1]. Online: during generation, apply sliding‑window confidence filtering and early‑stop weak traces to save tokens [1].

Operational choices

Weighted voting: responses are averaged/weighted by estimated confidence [1].
Filtering: progressively drop traces below adaptive thresholds (e.g., quantiles) [1].
Consensus τ: stop when consensus across traces exceeds τ to avoid further generation [1].

Figure 1: Early‑stop via group confidence and consensus τ [1].

Key results

On AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and up to −84.7% generated tokens relative to standard parallel thinking at equal budget [1]. Other evaluated tasks show similar trends of large token savings with controlled accuracy trade‑offs when increasing filter strength [1].

Comparison

Method	Budget K	Token (×10^8)	Accuracy %	Notes
DeepConf‑low (top‑10%)	512	—	99.9	AIME; ↓84.7% tokens vs standard [1]
DeepConf‑high (top‑90%)	512	—	~99–100	Higher coverage; smaller savings [1]
Majority Voting	512	—	≤99.9	No filtering; higher cost [1]

Minimal vLLM enablement

Logprobs: enable logprobs to derive per‑token confidence [1].
Sliding window: compute cumulative group confidence over window length L [1].
Early‑stop: threshold on quantile/minimum group value + consensus τ [1].
OpenAI‑compatible: extra args for window, quantile, and enable_logprobs [1].

Practical implications

“Low” (top‑10%) filter: maximizes token savings; ensure adequate consensus to avoid confidently wrong traces [1].
“High” (top‑90%) filter: keeps more traces; prefer when accuracy is paramount and budget is looser [1].
Risks: confidently wrong traces; use initial calibration and threshold warm‑up [1].
Consensus τ: set τ by number of traces and task variability [1].

Limitations and future work

Logprob‑based confidence can be miscalibrated for some models/domains; future work includes calibration strategies and studying how optimal windowing and tail statistics generalize across tasks [1].

References

[1] Deep Think with Confidence (DeepConf), arXiv:2508.15260 (v1), 21 Aug 2025.