EntroPIC: Stable Long-Term Training of LLMs

Abstract

Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps.

To address this, we propose EntroPIC, a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients via Proportional-Integral (PI) Control. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress.

Figure 1: EntroPIC uses PI control to dynamically adjust sample weights based on entropy error.

Methodology

01 High-Probability Tokens Matter

Not all tokens affect entropy equally. Our analysis reveals distinct impacts based on token probability and advantage. EntroPIC focuses control where it matters most:

Positive High-Prob: Standard RL drops entropy too fast. We reduce their weight to maintain exploration.
Negative Low-Prob: We avoid suppressing rare explorations to prevent degradation.

02 Precise Control at Any Target

Unlike static coefficients, EntroPIC's PI controller dynamically adjusts $\alpha$ to lock entropy to any arbitrary target value.

Entropy Convergence

Adaptive Coefficient ($\alpha$)

Core Algorithm Formulation

PI Controller

The adaptive coefficient $\alpha^t$ is updated based on the error between current entropy $\mathcal{H}^t$ and target $\mathcal{H}_{tar}$.

$$ \alpha^t = K_p(\mathcal{H}^t - \mathcal{H}_{tar}) + K_i \sum_{k=1}^{t-1} (\mathcal{H}^k - \mathcal{H}_{tar}) $$

EntroPIC Loss Function

We simplify the loss by applying the PI weight $\alpha$ only to high-probability tokens ($\pi_\theta > \tau$).

$$ \mathcal{L}(\theta) = \mathcal{L}_{GRPO}(\theta) - \alpha \sum_{\pi_\theta(a|s) > \tau} |A(s,a)| \log \pi_\theta(a|s) $$

Experimental Results

Entropy Stability

Training Reward

Validation Accuracy

Comprehensive Evaluation

Comparing EntroPIC with baselines on Qwen3-8B (Standard).

Off-Policy Training (Overall)

High Temperature (T=1.0)

Case Study: Reasoning Dynamics

Witness how high entropy enables Reflection and Self-Correction.

Prompt Let $$f(x)=\frac{(x-18)(x-72)(x-98)(x-k)}{x}.$$ There exist exactly three positive real values of $k$ such that $f$ has a minimum at exactly two real values of $x$. Find the sum of these three values of $k$.

Citation

@article{yang2025entropic,
  title={EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control},
  author={Yang, Kai and Xu, Xin and Chen, Yangkun and Liu, Weijie and Lyu, Jiafei and Lin, Zichuan and Ye, Deheng and Yang, Saiyong},
  journal={arXiv preprint arXiv:2511.15248},
  year={2025}
}