The article reveals that LLM non-determinism stems from a lack of "batch invariance" in GPU kernels, not just concurrency. Calculation strategy changes with server load, altering results. It proposes implementing batch-invariant kernels to guarantee reproducibility, accepting a performance trade-off.
Making LLM Inference Deterministic
This article explains why LLM inference is non-deterministic and how to achieve reproducibility.