What is the purpose of future (causal) masking in the decoder during training?
The purpose of future masking, also known as causal masking, in the decoder during training is to prevent the decoder from "cheating" by attending to future tokens in the output sequence when predicting the current token. During training, the decoder receives the entire target sequence as input, but it should only use the tokens that precede the current token to predict the current token. This mimics the autoregressive nature of sequence generation, where the model generates the output sequence one token at a time, based on the previously generated tokens. Without future masking, the decoder could simply "look ahead" to the future tokens and use them to predict the current token, which would lead to unrealistically good performance during training but poor performance during inference, when it doesn't have access to the future tokens. Future masking is typically implemented by creating a triangular mask that prevents the decoder from attending to tokens that come after the current token. This mask is applied to the attention scores before the softmax function is applied, setting the scores for future tokens to a very large negative value (e.g., -infinity), similar to padding masks. This ensures that the attention weights for future tokens are effectively zero, preventing them from contributing to the weighted sum of value vectors. By using future masking during training, the decoder learns to generate the output sequence in an autoregressive manner, which is consistent with how it will be used during inference, leading to better generalization and performance.