In Reinforcement Learning from Human Feedback, the reward model acts as a proxy for human judgment to automate the evaluation of model outputs. During the training of a language model, it is difficult to mathematically define what makes a response good because human preferences are nuanced. To solve this, researchers collect a dataset of responses ranked by hu....
Log in to view the answer