How does multi-head attention improve upon single self-attention within the Transformer architecture?
Multi-head attention enhances single self-attention by allowing the model to attend to different parts of the input sequence and capture different relationships between words simultaneously. Self-attention, at its core, calculates how much each word in a sequence relates to every other word, creating a weighted representation of the sequence. In single self-attention, this process happens only once, limiting the model to learning only one type of relationship. Multi-head attention runs the self-attention mechanism multiple times in parallel, each time with different learned linear projections of the input. These projections transform the input into different representation spaces, allowing each 'head' to focus on different aspects of the relationships between words. For example, one head might focus on syntactic relationships (e.g., subject-verb agreement), while another might focus on semantic relationships (e.g., word meaning and context). The outputs from all the heads are then concatenated and linearly transformed to produce the final output. This parallel processing and diverse representation learning enable the model to capture a richer and more nuanced understanding of the input sequence compared to single self-attention, leading to improved performance in tasks such as language translation and text understanding. By capturing multiple types of relationships, multi-head attention prevents the model from being overly sensitive to any single relationship type, leading to more robust and generalizable representations.