The primary architectural difference is that the encoder utilizes a bidirectional self-attention mask, while the decoder utilizes a causal or masked self-attention mask. In a Transformer encoder, every token in a sequence can attend to every other token, meaning the model can look at both the words appearing before and after a s....
Log in to view the answer