Abstract
Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps.To address this, we propose EntroPIC, a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients via Proportional-Integral (PI) Control. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress.
Figure 1: EntroPIC uses PI control to dynamically adjust sample weights based on entropy error.
Methodology
01 High-Probability Tokens Matter
Not all tokens affect entropy equally. Our analysis reveals distinct impacts based on token probability and advantage. EntroPIC focuses control where it matters most:
- Positive High-Prob: Standard RL drops entropy too fast. We reduce their weight to maintain exploration.
- Negative Low-Prob: We avoid suppressing rare explorations to prevent degradation.
02 Precise Control at Any Target
Unlike static coefficients, EntroPIC's PI controller dynamically adjusts $\alpha$ to lock entropy to any arbitrary target value.
Entropy Convergence
Adaptive Coefficient ($\alpha$)
Core Algorithm Formulation
PI Controller
The adaptive coefficient $\alpha^t$ is updated based on the error between current entropy $\mathcal{H}^t$ and target $\mathcal{H}_{tar}$.
EntroPIC Loss Function
We simplify the loss by applying the PI weight $\alpha$ only to high-probability tokens ($\pi_\theta > \tau$).
Experimental Results
Entropy Stability
Training Reward
Validation Accuracy
Comprehensive Evaluation
Comparing EntroPIC with baselines on Qwen3-8B (Standard).
Off-Policy Training (Overall)
High Temperature (T=1.0)
Case Study: Reasoning Dynamics
Witness how high entropy enables Reflection and Self-Correction.
Citation
@article{yang2025entropic,
title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
journal={arXiv preprint arXiv:2511.15248},
year={2025}
}