Govur University Logo
--> --> --> -->
...

When a text model sees a word it has never seen before during training, what special token is often used to represent this unknown word?



When a text model encounters a word it has never seen before during its training phase, this word is considered an out-of-vocabulary (OOV) word. To represent such unknown words, a special token is often used. This token is commonly denoted as `[UNK]` or `<unk>`. The purpose of this unknown token is to provide a standardized numerical representation for any word that does not exist within the model's pre-defined vocabulary. A vocabulary is the finite set of unique words that the model has learned or been exposed to during its training, typically built from the most frequent words in the training data. When an OOV word is encountered, the model maps it to the `[UNK]` token, effectively treating all unknown words as a single, generic category. For instance, if a model's vocabulary only contains 'apple', 'banana', and 'orange', and it encounters the word 'grapefruit' during training, 'grapefruit' would be replaced by the `[UNK]` token.