The main sources of computational bottlenecks when training large Transformer models are the self-attention mechanism, the feed-forward networks, and the large vocabulary size. The self-attention mechanism has a computational complexity of O(n^2), where n is the sequence length. This means that the computational cost of self-attention grows quadratically with the sequence length, making it a major bottleneck for long sequences. Calculating the attention weights between every pair of words becomes very expensive for long documents or paragraphs. The feed-forward networks, ....
Log in to view the answer