What is RLHF (Reinforcement Learning from Human Feedback)?

Without RLHF, an LLM completes text — give it 'the capital of France is' and it returns 'Paris.' But for chat-style assistance, you need a model that follows instructions, refuses harmful requests, and produces helpful responses. RLHF (and its successors like DPO) is how that's done: human raters compare pairs of model outputs, mark one as 'better,' and a reward model learns the preferences. The base model is then fine-tuned via reinforcement learning to maximize the reward. ChatGPT's effectiveness over raw GPT-3.5 came largely from RLHF. The technique is also how models are 'aligned' to refuse harmful requests, though alignment via RLHF has known failure modes (sycophancy, over-refusal).

מה זה RLHF (Reinforcement Learning from Human Feedback)?

מונחים קשורים