Question

What is the primary architectural difference between a standard Transformer encoder and a decoder regarding the constraints placed on the self-attention mask?

Accepted Answer

The primary architectural difference is that the encoder utilizes a bidirectional self-attention mask, while the decoder utilizes a causal or masked self-attention mask. In a Transformer encoder, every token in a sequence can attend to every other token, meaning the model can look at both the words appearing before and after a specific word to build context. This allows the encoder to develop a deep, global understanding of the entire input. In contrast, the decoder employs a causal mask to restrict the information flow so that each token can only attend to itself and the tokens that appeared before it in the sequence. This constraint is essential during autoregressive generation, where the model must predict the next word without seeing the future words in the sequence. If the decoder were allowed to attend to future tokens, it would essentially cheat by seeing the answer during the training process. Consequently, the encoder is designed for full context extraction, while the decoder is designed to maintain the temporal integrity required for generating sequences one step at a time.

Home → All Courses → Engineering and Technology Courses → Natural Language Processing Engineering → Flashcard

What is the primary architectural difference between a standard Transformer encoder and a decoder regarding the constraints placed on the self-attention mask?