Abstract
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context. However, existing RL methods struggle to maintain an appropriate level of entropy as positive samples reduce it while negative samples increase it.We propose EntroPIC, a novel method that uses Proportional-Integral (PI) control to adaptively adjust the loss coefficients of positive and negative samples. This stabilizes entropy throughout training, ensuring efficient exploration and steady progress.
Figure 1: EntroPIC uses PI control to dynamically adjust sample weights based on entropy error.
Methodology
01 High-Probability Tokens Matter
Not all tokens affect entropy equally. Our analysis reveals distinct impacts based on token probability and advantage. EntroPIC focuses control where it matters most:
- Positive High-Prob: Standard RL drops entropy too fast. We reduce their weight to maintain exploration.
- Negative Low-Prob: We avoid suppressing rare explorations to prevent degradation.
02 Precise Control at Any Target
Unlike static coefficients, EntroPIC's PI controller dynamically adjusts $\alpha$ to lock entropy to any arbitrary target value.
Entropy Convergence
Adaptive Coefficient ($\alpha$)
Experimental Results
Entropy Stability
Training Reward
Validation Accuracy
Main Performance Comparison
Comparing performance across mathematical datasets.
Off-Policy Training (Overall)
High Temperature (T=1.0)
Case Study: Reasoning Dynamics
Witness how high entropy enables Reflection and Self-Correction.
Prompt
Let
$$f(x)=\frac{(x-18)(x-72)(x-98)(x-k)}{x}.$$
There exist exactly three positive real values of $k$ such that $f$ has a minimum at exactly two real values of $x$. Find the sum of these three values of $k$.
Citation
@misc{yang2025entropic,
title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
author={Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang},
year={2025},
eprint={2511.15248},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.15248},
}