EntroPIC: Stable Long-Term Training of LLMs

Abstract

Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context. However, existing RL methods struggle to maintain an appropriate level of entropy as positive samples reduce it while negative samples increase it.

We propose EntroPIC, a novel method that uses Proportional-Integral (PI) control to adaptively adjust the loss coefficients of positive and negative samples. This stabilizes entropy throughout training, ensuring efficient exploration and steady progress.

Figure 1: EntroPIC uses PI control to dynamically adjust sample weights based on entropy error.

Methodology

01 High-Probability Tokens Matter

Not all tokens affect entropy equally. Our analysis reveals distinct impacts based on token probability and advantage. EntroPIC focuses control where it matters most:

Positive High-Prob: Standard RL drops entropy too fast. We reduce their weight to maintain exploration.
Negative Low-Prob: We avoid suppressing rare explorations to prevent degradation.

02 Precise Control at Any Target

Unlike static coefficients, EntroPIC's PI controller dynamically adjusts $\alpha$ to lock entropy to any arbitrary target value.

Entropy Convergence

Adaptive Coefficient ($\alpha$)

Experimental Results

Entropy Stability

Training Reward

Validation Accuracy

Main Performance Comparison

Comparing performance across mathematical datasets.

Off-Policy Training (Overall)

High Temperature (T=1.0)

Case Study: Reasoning Dynamics

Witness how high entropy enables Reflection and Self-Correction.

Prompt Let $$f(x)=\frac{(x-18)(x-72)(x-98)(x-k)}{x}.$$ There exist exactly three positive real values of $k$ such that $f$ has a minimum at exactly two real values of $x$. Find the sum of these three values of $k$.

Citation

@misc{yang2025entropic,
      title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control}, 
      author={Kai Yang and Xin Xu and Yangkun Chen and Weijie Liu and Jiafei Lyu and Zichuan Lin and Deheng Ye and Saiyong Yang},
      year={2025},
      eprint={2511.15248},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.15248}, 
}