Thinkless: Adaptive Reasoning in Large Language Models

Introduction

Large language models (LLMs) excel at complex tasks but often apply exhaustive reasoning to every problem. Thinkless allows models to decide when detailed reasoning is necessary, conserving resources on simpler tasks.

The Thinkless Approach

Adaptive Reasoning Modes

Using the control tokens <short> and <think>, Thinkless switches between concise and chain-of-thought responses, improving efficiency without sacrificing accuracy.

Reinforcement Learning Framework

The Decoupled Group Relative Policy Optimization (DeGRPO) algorithm separates mode selection from accuracy improvement, stabilizing training and preventing mode collapse.

Training and Optimization

Distillation Warm-up

The model first learns from expert teachers—one focused on detailed reasoning and another on concise replies—establishing a base for adaptive switching.

Decoupled GRPO

During reinforcement learning, different weights for control and response tokens keep training balanced and maintain decision pathways.

Performance and Results

Empirical Benchmarks

Across datasets like MATH-500 and GSM8K, Thinkless reduces unnecessary long-form reasoning by 50%-90%, accelerating inference.

Training Dynamics

Training exhibits a "U-shaped" curve: initial reliance on detailed reasoning gradually shifts toward concise answers as the model learns task complexity.

Comparative Advantage

Compared to traditional or heuristic models, Thinkless better balances reasoning depth and computational cost.

Future Directions

Further improvements could involve advanced fine-tuning and larger datasets to boost initial performance.

Conclusion

Thinkless marks a significant advance in adaptive reasoning for LLMs, reducing overhead while maintaining accuracy.