Govur University Logo
--> --> --> -->
...

In Reinforcement Learning from Human Feedback (RLHF), what is the function of the reward model that is trained before the final policy optimization step?



In Reinforcement Learning from Human Feedback, the reward model acts as a proxy for human judgment to automate the evaluation of model outputs. During the training of a language model, it is difficult to mathematically define what makes a response good because human preferences are nuanced. To solve this, researchers collect a dataset of responses ranked by hu....

Log in to view the answer



Redundant Elements