Without RLHF, an LLM completes text — give it 'the capital of France is' and it returns 'Paris.' But for chat-style assistance, you need a model that follows instructions, refuses harmful requests, and produces helpful responses. RLHF (and its successors like DPO) is how that's done: human raters compare pairs of model outputs, mark one as 'better,' and a reward model learns the preferences. The base model is then fine-tuned via reinforcement learning to maximize the reward. ChatGPT's effectiveness over raw GPT-3.5 came largely from RLHF. The technique is also how models are 'aligned' to refuse harmful requests, though alignment via RLHF has known failure modes (sycophancy, over-refusal).
СЛОВАРЬ
Что такое RLHF (Reinforcement Learning from Human Feedback)?
The technique that turns a base LLM into a useful assistant — by having humans rate model responses and using that feedback to fine-tune behavior.
СВЯЗАННЫЕ ТЕРМИНЫ
Fine-tuning
Taking a pre-trained AI model and continuing to train it on your specific data so it specializes for your use case (medical, legal, customer support style, etc.).
AI Alignment
The discipline of ensuring AI systems behave in ways that match human values and intentions — both in safety (don't cause harm) and in usefulness.
LLM (Large Language Model)
An AI system trained on massive text datasets to predict and generate human-like text — the technology behind ChatGPT, Claude, Gemini, and most modern AI chatbots.
Назад к Словарь ИИ