What is the "attention span" of a specific layer in a Transformer, and how can you measure it?
The "attention span" of a specific layer in a Transformer refers to the ability of that layer to attend to dependencies between words that are separated by a certain distance in the input sequence. It indicates how far apart two words can be for the layer to effectively capture the relationship between them. While the theoretical attention span of the Transformer is the entire sequence length due to the direct connections in self-attention, in practice, different layers and heads often exhibit varying attention spans. Measuring the attention span is not straightforward, but several techniques can provide insights. One method involves analyzing attention weights for specific input sequences. By examining the attention weights for a given word, we can identify the words in the input sequence that the model is attending to most strongly. The distance between these words and the given word can be used as an estimate of the attention span. For example, if we find that a particular layer primarily attends to words that are within a window of 5 words around the given word, we can say that its attention span is approximately 5. Another approach involves designing specific probing tasks to assess the layer's ability to capture long-range dependencies. This could involve creating sentences where the relationship between two words depends on their distance in the sequence. By training the model on this task and analyzing its performance, we can infer the attention span of different layers. A more quantitative measure involves calculating the average distance between attended words. For each word, you can find the attended words (those with high attention weights) and calculate the average distance between them and the word itself. Averaging this distance across all words in the sequence provides an estimate of the average attention span of the layer. It's important to note that the attention span can vary across different attention heads within the same layer, and also different layers in the network. Visualizing attention maps for different heads and layers can also provide qualitative insights into their respective attention spans. Lower layers might focus on local dependencies (short attention spans), while higher layers might capture more global dependencies (longer attention spans).