What are the main sources of computational bottlenecks when training large Transformer models?
The main sources of computational bottlenecks when training large Transformer models are the self-attention mechanism, the feed-forward networks, and the large vocabulary size. The self-attention mechanism has a computational complexity of O(n^2), where n is the sequence length. This means that the computational cost of self-attention grows quadratically with the sequence length, making it a major bottleneck for long sequences. Calculating the attention weights between every pair of words becomes very expensive for long documents or paragraphs. The feed-forward networks, which are applied independently to each word representation, also contribute to the computational cost. While the feed-forward networks are linear in the sequence length, they typically have a larger hidden layer size than the embedding dimension, which increases the computational cost. For example, if the embedding dimension is 512 and the hidden layer size is 2048, the feed-forward networks will require significantly more computation than the self-attention mechanism. The large vocabulary size in the output softmax layer is another major bottleneck. The softmax layer calculates the probability distribution over all words in the vocabulary, which can be very large (e.g., tens of thousands or even millions of words). This requires a significant amount of computation and memory, especially for long sequences. Techniques like hierarchical softmax or negative sampling can be used to reduce the computational cost of the softmax layer. Additionally, data movement between devices can become a bottleneck when using model parallelism or data parallelism. Communicating intermediate activations and gradients between devices can be time-consuming, especially when using a slow network connection. Careful optimization of the communication strategy is essential for achieving good performance when training large Transformer models in parallel.