Standard positional encodings in Transformers, such as the fixed sinusoidal functions used in the original architecture, suffer from the limitation of being absolute rather than relative. These encodings assign a unique, static vector to every position index in a sequence, effectively labeling tokens as being at position one, position two, and so on. Because these vectors are added directly to token embeddings before the self-attention mechanism, the model lear....
Log in to view the answer