Layer normalization is a technique used in transformer-based large language models (LLMs) to stabilize and accelerate training by normalizing the inputs to each layer. According to the original transformer paper ("Attention is All You Need," Vaswani et al., 2017) and NVIDIA’s NeMo documentation, layer normalization reduces internal covariate shift by ensuring that the mean and variance of activations remain consistent across layers, mitigating issues like vanishing or exploding gradients in deep networks. This is particularly crucial in transformers, which have many layers and process long sequences, making them prone to training instability. By normalizing the activations (typically after the attention and feed-forward sub-layers), layer normalization improves gradient flow and convergence. Option A is incorrect, as layer normalization does not reduce computational complexity but adds a small overhead. Option C is false, as it does not add significant parameters. Option D is wrong, as layer normalization complements, not replaces, the attention mechanism.
[References:, Vaswani, A., et al. (2017). "Attention is All You Need.", NVIDIA NeMo Documentation: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/intro.html, , ]
Submit