Question

What specific limitation of standard positional encodings in Transformers makes them unable to naturally capture the relative distance between tokens without explicit geometric mapping?

Accepted Answer

Standard positional encodings in Transformers, such as the fixed sinusoidal functions used in the original architecture, suffer from the limitation of being absolute rather than relative. These encodings assign a unique, static vector to every position index in a sequence, effectively labeling tokens as being at position one, position two, and so on. Because these vectors are added directly to token embeddings before the self-attention mechanism, the model learns the meaning of a specific index rather than the relationship between indices. During self-attention, the model calculates the dot product between a query vector and a key vector to determine importance. When absolute encodings are used, the relationship between two tokens at positions i and j is buried inside the combined vector representations, meaning the attention mechanism does not receive an explicit signal about how far apart the tokens are. While the model can mathematically learn to derive distance through the interaction of absolute positions, it cannot naturally or intuitively represent the relative distance as a fundamental property. This is why the model fails to capture translation invariance, where the relationship between two words should ideally be the same regardless of where they appear in a sentence. Without explicit geometric mapping or relative positional bias, the model treats the distance between index 1 and 2 the same as the distance between index 101 and 102 as a computation on unrelated absolute coordinates rather than as a consistent offset.

Home → All Courses → Engineering and Technology Courses → Natural Language Processing Engineering → Flashcard

What specific limitation of standard positional encodings in Transformers makes them unable to naturally capture the relative distance between tokens without explicit geometric mapping?