Published on 23/05/2025
Large language models (LLMs) excel at complex tasks but often apply exhaustive reasoning to every problem. Thinkless allows models to decide when detailed reasoning is necessary, conserving resources on simpler tasks.
Using the control tokens <short>
and <think>
, Thinkless switches between concise and chain-of-thought responses, improving efficiency without sacrificing accuracy.
The Decoupled Group Relative Policy Optimization (DeGRPO) algorithm separates mode selection from accuracy improvement, stabilizing training and preventing mode collapse.
The model first learns from expert teachers—one focused on detailed reasoning and another on concise replies—establishing a base for adaptive switching.
During reinforcement learning, different weights for control and response tokens keep training balanced and maintain decision pathways.
Across datasets like MATH-500 and GSM8K, Thinkless reduces unnecessary long-form reasoning by 50%-90%, accelerating inference.
Training exhibits a "U-shaped" curve: initial reliance on detailed reasoning gradually shifts toward concise answers as the model learns task complexity.
Compared to traditional or heuristic models, Thinkless better balances reasoning depth and computational cost.
Further improvements could involve advanced fine-tuning and larger datasets to boost initial performance.
Thinkless marks a significant advance in adaptive reasoning for LLMs, reducing overhead while maintaining accuracy.