What is the primary advantage of using self-attention over recurrent neural networks for capturing long-range dependencies?
The primary advantage of self-attention over recurrent neural networks (RNNs) in capturing long-range dependencies is its ability to directly access and weigh the importance of *anyword in the input sequence when processing a given word. RNNs, like LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units), process the input sequentially, meaning they read the input one word at a time, maintaining a hidden state that summarizes the information seen so far. Capturing dependencies between distant words in a sequence requires the information to be passed through multiple sequential steps, which can lead to information loss or dilution, known as the vanishing gradient problem, making it difficult for the network to "remember" the early parts of the sequence when processing later parts. Self-attention, on the other hand, operates in parallel. For each word in the sequence, it calculates an attention score with every other word, directly assessing their relevance. These attention scores are then used to weight the contributions of each word when representing the current word. This direct access avoids the sequential processing bottleneck of RNNs, allowing the model to more easily capture relationships between words regardless of their distance in the sequence. For example, in the sentence "The cat, which chased the mouse, finally slept," self-attention can directly relate "cat" and "slept" without needing to sequentially process "which chased the mouse," enabling a more accurate representation of the sentence meaning.