Describe the mathematical process of calculating attention weights in self-attention.
The mathematical process of calculating attention weights in self-attention involves three key steps: calculating the similarity between each word's query and all other words' keys, scaling the resulting similarity scores, and applying a softmax function to obtain probabilities representing the attention weights. First, each word in the input sequence is transformed into three vectors: a query (q), a key (k), and a value (v). These vectors are typically obtained by multiplying the word's embedding by learned weight matrices. The query represents what the word is "looking for," the key represents what the word "offers," and the value represents the actual information content of the word. To calculate the attention weights for a given word (i) with respect to all other words (j) in the sequence, the dot product of the query vector of word i (q_i) and the key vector of word j (k_j) is computed: score_{i,j} = q_i · k_j. This dot product measures the similarity or compatibility between the query and key vectors. These similarity scores are then scaled down by the square root of the dimension of the key vectors (√d_k), where d_k is the dimension of the key vectors. This scaling helps to prevent the dot products from becoming too large, which can lead to small gradients during training. The scaled scores are then passed through a softmax function to produce probabilities: attention_{i,j} = softmax(score_{i,j} / √d_k). The softmax function normalizes the scores into a probability distribution, ensuring that the attention weights sum to 1 for each word. These probabilities represent the attention weights, indicating how much attention word i should pay to word j when computing its representation. Finally, these attention weights are used to compute a weighted sum of the value vectors of all the words in the sequence. The resulting vector represents the attention-weighted representation of word i, incorporating information from all other words in the sequence based on their relevance as determined by the attention weights.