NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Bugs in LLM Training – Gradient Accumulation Fix (unsloth.ai)
imjonse 5 hours ago [-]
Same issue described on HF: https://huggingface.co/blog/gradient_accumulation

It also highlights the main disadvantage of Transformers codebase using the copy-paste method for models, where this fix needs to be applied to every single model separately.

xcodevn 2 hours ago [-]
Look from a different point of view: this is a feature, not a bug. With this, every example has equal weight, while with the fix, every token has equal weight.
oergiR 1 hours ago [-]
That makes it sound like it’s a choice, which it isn’t really. The way to look at it is from a probabilistic perspective: with the fix, you maximise the probability of the data. Without the fix, you fairly arbitrarily raise some probabilities to a power greater than one, and some to a power less than one.
danielhanchen 2 hours ago [-]
Yes you're correct, but in normal full batch training without gradient accumulation, all tokens are weighted equally. Standard grad accum does not, and so the "fix" makes grad accum and full batch training finally mathematically equivalent
danielhanchen 5 hours ago [-]
Oh hey! :) TLDR naively gradient accumulation was over-weighting short sequence lengths in LLM finetuning and training runs, and under-weighting long sequence lengths.

For eg a text with sequence lengths of [1, 100] would be scaled by 1/(100+1) in full batch training, but grad accum of 2 would weight [1] as 1/1 * 1/2 = 1/2, whilst [100] as 1/100 * 1/2 = 1/200. (1/2 since grad accum needs to divide by the # of grad accum steps)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 10:14:08 GMT+0000 (UTC) with Wasmer Edge.