The "attention span" of a specific layer in a Transformer refers to the ability of that layer to attend to dependencies between words that are separated by a certain distance in the input sequence. It indicates how far apart two words can be for the layer to effectively capture the relationship between them. While the theoretical attention span of the Transformer is the entire sequence length due to the direct connections in self-attention, in practice, different layers and heads often exhibit varying attention spans. Measuring the attention span is not straightforward, but several techniques can provide insights. One method involves analyzing attention weights for specific input sequen....
Log in to view the answer