RLHF Guide: How to Train LLMs with Human Feedback

When Anthropic trained Claude using a technique called Reinforcement Learning from Human Feedback, the model scored measurably higher on helpfulness benchmarks than its base counterpart—without any additional pretraining compute.

That single result helped establish RLHF as the dominant post-training method for large language models, and it is now baked into the pipelines at OpenAI, Google DeepMind, and Meta AI alike.

If you are a developer trying to fine-tune an open-source model for a production application, understanding RLHF is no longer optional.

This guide walks through every stage of the process: what you need before you start, how to implement each training phase, common failure modes with concrete fixes, and real-world examples from teams that have shipped RLHF-trained models at scale.

By the end, you will have a working mental model of the entire framework and a clear path to running your first reward-model training loop.


Prerequisites Before You Write a Single Line of Training Code

RLHF is not a plug-and-play fine-tuning script. Before touching the training loop, you need to satisfy a specific set of infrastructure, data, and compute requirements. Skipping these steps is the single largest cause of failed RLHF experiments among developer teams.

Compute and Memory Requirements

“RLHF represents a fundamental shift in how we approach LLM optimization—moving from raw capability metrics to alignment with human expectations is proving 3-4x more cost-effective than scaling parameters alone.” — Sarah Chen, Senior AI Research Lead at Hugging Face

The minimum practical setup for RLHF on a 7-billion-parameter model is 4× A100 80GB GPUs using DeepSpeed ZeRO Stage 3 or FSDP (Fully Sharded Data Parallel). Training a 13B model requires at least 8× A100s. If you are working on a budget, the Hugging Face TRL library supports 4-bit quantization via bitsandbytes, which can reduce memory consumption by roughly 60%, but gradient quality degrades noticeably during PPO (Proximal Policy Optimization) updates.

Cloud pricing matters here. As of 2024, a single A100 80GB instance on AWS (p4d.24xlarge) costs approximately $32/hour on-demand. A full RLHF run on a 7B model typically requires 20–40 GPU-hours, putting baseline costs between $640 and $1,280 per training run. Tools like AI Cost can help you model and track these infrastructure expenses before committing to a training budget.

Data Requirements

You need two distinct datasets:

  1. Supervised Fine-Tuning (SFT) dataset — typically 10,000 to 50,000 high-quality prompt-response pairs. These teach the model the format and style you want before reward modeling begins.
  2. Preference dataset — pairs of model outputs labeled by human annotators with a preference signal (response A is better than response B). OpenAI’s original InstructGPT paper used approximately 50,000 comparison pairs for their reward model. Most open-source projects work with 20,000–40,000.

For the preference dataset, annotation quality matters far more than volume. Scale AI and Surge AI are the two most commonly used commercial annotation platforms for RLHF data collection. Expect to pay $0.10–$0.50 per comparison pair depending on task complexity.

Software Stack

The three most widely used open-source RLHF frameworks are:

  • Hugging Face TRL — most beginner-friendly, supports PPO and DPO
  • DeepSpeed-Chat — optimized for multi-GPU distributed training
  • OpenRLHF — newer, supports Ray-based distributed rollouts and is increasingly used for 70B+ models

Install TRL with:

pip install trl transformers accelerate peft bitsandbytes

The Three Training Phases, Explained Step by Step

RLHF is not one training step—it is three distinct phases that must happen in sequence. Confusing the order or mixing objectives across phases is a reliable way to produce a model that is worse than your starting checkpoint.

Phase 1 — Supervised Fine-Tuning (SFT)

Start with a pretrained base model. Llama 3.1 8B, Mistral 7B v0.3, and Qwen 2.5 7B are all reasonable starting points in 2024. Fine-tune this base model on your SFT dataset using standard next-token prediction loss.

A minimal TRL SFT training script:

from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")

training_args = TrainingArguments(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=your_sft_dataset,
    dataset_text_field="text",
    args=training_args,
)
trainer.train()

Key SFT tip: Use a learning rate between 1e-5 and 3e-5. Higher rates cause catastrophic forgetting of the base model’s general capabilities. Run evaluation on a held-out set after each epoch and stop training when perplexity plateaus.

Phase 2 — Reward Model Training

The reward model (RM) is a separate neural network that learns to predict human preference scores. Take your SFT model, add a scalar output head, and train it on your comparison dataset using a Bradley-Terry loss (pairwise ranking loss).

from trl import RewardTrainer, RewardConfig

reward_config = RewardConfig(
    output_dir="./reward-model",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    learning_rate=1e-5,
    max_length=512,
)

reward_trainer = RewardTrainer(
    model=reward_model,
    tokenizer=tokenizer,
    args=reward_config,
    train_dataset=comparison_dataset,
)
reward_trainer.train()

One critical architectural detail: freeze the base model’s embedding layers during reward model training. Unfreezing them causes reward hacking (discussed below) to emerge much earlier in PPO training.

Evaluate your reward model using accuracy on a held-out preference test set. A well-trained RM typically achieves 65–75% accuracy. Below 60% means your annotation data is inconsistent; recheck labeling guidelines.

Phase 3 — Proximal Policy Optimization (PPO)

This is the actual reinforcement learning step. The SFT model acts as the policy. For each prompt, the policy generates a response. The reward model scores that response. PPO updates the policy to maximize reward while a KL-divergence penalty prevents the model from drifting too far from the original SFT distribution.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

ppo_config = PPOConfig(
    model_name="./sft-output",
    learning_rate=1.41e-5,
    batch_size=64,
    mini_batch_size=4,
    gradient_accumulation_steps=1,
    ppo_epochs=4,
    kl_penalty="kl",
    init_kl_coef=0.2,
    adap_kl_ctrl=True,
)

The init_kl_coef=0.2 setting is a standard starting point from the InstructGPT paper. If your model starts generating incoherent text after 500 PPO steps, increase this value to 0.5. The adaptive KL controller (adap_kl_ctrl=True) will automatically adjust the coefficient during training, which is strongly recommended for production runs.

For developer teams exploring AI orchestration frameworks alongside RLHF pipelines, Semantic Kernel offers solid tooling for integrating fine-tuned models into multi-step reasoning applications post-training.


Direct Preference Optimization: The PPO Alternative Worth Knowing

PPO-based RLHF is compute-intensive and notoriously difficult to stabilize. In 2023, researchers at Stanford published Direct Preference Optimization (DPO), which eliminates the separate reward model entirely and trains the policy directly on preference data using a supervised loss. The arXiv paper showed DPO matches or exceeds PPO-RLHF on several benchmarks while being 2–3× faster to train.

When to Use DPO vs. PPO

Use DPO when:

  • Your team has fewer than 8 GPUs available
  • You have a clean preference dataset with high annotator agreement
  • Training stability is a higher priority than maximum reward optimization

Use PPO when:

  • You need the model to optimize for a specific verifiable reward signal (coding correctness, math accuracy)
  • Your application requires online data collection (generating new comparisons during training)
  • You are training a model larger than 30B parameters where DPO’s memory advantages are less significant

The TRL library supports both. Switch between them by swapping PPOTrainer for DPOTrainer with minimal configuration changes. Teams building AI automation tools with DronaHQ or similar low-code platforms often prefer DPO for its faster iteration cycle when deploying domain-specific models.


Common Errors and How to Fix Them

Reward Hacking

Reward hacking occurs when the policy learns to exploit weaknesses in the reward model rather than genuinely improving response quality. Symptoms include the model generating extremely long responses (if length correlates with reward), excessive sycophancy, or repetitive phrase patterns.

Fixes:

  • Add a length penalty to the reward signal: final_reward = rm_score - 0.1 * (token_count / max_tokens)
  • Increase the KL coefficient to restrict how far the policy drifts
  • Collect new preference data specifically targeting the failure mode and retrain the RM

KL Divergence Collapse

If KL divergence drops to near zero during PPO training, the policy has stopped exploring. This usually means the learning rate is too low or the KL penalty is too aggressive. Lower init_kl_coef to 0.05 and raise the policy learning rate slightly.

Out-of-Memory Errors During PPO Rollouts

PPO requires holding four models in memory simultaneously: the policy, the reference policy, the reward model, and the value head. Use gradient checkpointing and enable bitsandbytes 8-bit Adam optimizer:

from bitsandbytes.optim import Adam8bit
optimizer = Adam8bit(model.parameters(), lr=1.41e-5)

For teams concerned about model security during training and deployment, CrowdStrike Analysis covers threat considerations relevant to AI infrastructure environments.

Annotator Disagreement Degrading Reward Model Quality

When inter-annotator agreement (measured by Cohen’s kappa) falls below 0.4, your reward model will train on noise. The fix is annotator calibration sessions: show all annotators the same 50 examples, discuss disagreements, and update the labeling guide before resuming data collection. Aim for kappa above 0.6 before scaling annotation.

For deeper background on why annotation quality affects model behavior at a mechanistic level, the Princeton Understanding Large Language Models course materials provide excellent foundational reading.


Real-World RLHF Implementations

Hugging Face’s Zephyr 7B is one of the most well-documented open RLHF projects. The team used DPO rather than PPO and trained on the UltraFeedback dataset (64,000 preference pairs). The resulting model outperformed Llama 2 Chat 70B on MT-Bench, a rigorous multi-turn chat benchmark, despite having one-tenth the parameters. Their training cost was under $500 in cloud compute, demonstrating that RLHF is accessible outside of large research labs.

Cohere’s Command R series uses a proprietary RLHF pipeline focused on retrieval-augmented generation quality. Cohere has publicly stated that RLHF accounts for a significant portion of the model’s improved citation accuracy compared to purely SFT-trained baselines.

Mistral AI’s Mistral-7B-Instruct is trained using a mix of SFT and preference optimization. The model’s strong performance on coding tasks is attributed specifically to reward model training on coding-specific comparison data, not just general instruction following.

These examples share a common pattern: targeted preference datasets beat generic ones. A reward model trained on 15,000 high-quality domain-specific comparisons consistently outperforms one trained on 50,000 generic pairs.

For teams building AI-powered coding tools, Refact provides an AI coding assistant that can be integrated with models you fine-tune using RLHF pipelines.


Practical Recommendations for Developer Teams

1. Start with DPO, not PPO. Unless you have a specific reason to need online RL (real-time reward signals, verifiable correctness tasks), DPO will get you to a better model faster with less infrastructure headache. Revisit PPO only after you have validated that your reward signal and data pipeline work correctly.

2. Invest in annotation infrastructure before model training. A weak reward model trained on inconsistent annotations will produce a model that is confidently wrong. Spend at least two weeks on annotator calibration and guideline development before collecting comparison data at scale.

3. Monitor KL divergence as your primary training health metric. Loss curves in PPO are often misleading. KL divergence between the policy and the reference model tells you far more about whether training is proceeding healthily. Keep it in the 0.1–0.5 range throughout training.

4. Use Parameter-Efficient Fine-Tuning (PEFT) with LoRA for the SFT phase. Training only the LoRA adapters during SFT reduces memory requirements by 60–70% and speeds up iteration. Apply full fine-tuning only during PPO if your compute budget allows. The Princeton Understanding Large Language Models resources explain why full fine-tuning matters more during RL phases than SFT.

5. Version-control your reward model checkpoints separately from your policy checkpoints. Many teams learn this lesson the hard way: if you need to debug reward hacking six weeks into a production deployment, you will need to audit the reward model that was active during training. Keep every checkpoint with a clear timestamp and training-step label.

For teams looking to integrate RLHF-trained models into broader automation workflows, Luthor and IoTellect both support custom model endpoints that work with locally hosted fine-tuned models.


Common Questions About RLHF Training

How many GPU hours does RLHF realistically take for a 7B model? A complete run—SFT, reward model training, and PPO—on a 7B parameter model typically takes 20–40 GPU-hours on A100 80GB hardware. DPO cuts this to 8–15 GPU-hours by eliminating the separate RM training and PPO rollouts.

Can RLHF make a model worse than the base SFT model? Yes, and it happens frequently. Aggressive PPO training with a weak reward model can degrade general capabilities while improving only the narrow behaviors the reward model measures.

Always evaluate on a broad benchmark suite (MMLU, HellaSwag, MT-Bench) throughout training, not just on your task-specific metrics.

McKinsey research shows that 60% of AI projects that fail in production do so because of evaluation gaps, not training errors.

What is the minimum viable preference dataset size for a domain-specific reward model? For a narrow domain (a specific coding language, a particular customer service context), 5,000–10,000 high-quality comparison pairs can produce a functional reward model. Below 5,000, variance is high and results are unreliable. For general-purpose instruction following, 20,000+ is a practical minimum.

How do you detect reward hacking before it causes production problems? Track three signals simultaneously: average response length (sudden increases are a red flag), response diversity on a fixed prompt set (measure self-BLEU; lower is better), and human evaluation scores on a held-out eval set distinct from your training comparisons. Automated metrics alone will not catch all forms of reward hacking—schedule weekly human evaluation reviews throughout PPO training.


The Verdict on RLHF for Production Teams

RLHF is the most effective method available for aligning a language model’s outputs with human preferences, but it carries real implementation complexity and cost. The good news is that the tooling has matured substantially since the InstructGPT paper landed in 2022.

Between Hugging Face TRL, DeepSpeed-Chat, and the growing ecosystem of open preference datasets like UltraFeedback and Anthropic’s HH-RLHF dataset, a skilled ML engineer can run a complete RLHF pipeline without access to proprietary infrastructure.

Start with DPO on a Mistral or Llama 3 base model, invest heavily in annotation quality, and monitor KL divergence throughout training. If your reward model achieves above 65% held-out accuracy and your PPO run stays in a healthy KL range, you will ship a model that is meaningfully better than a purely SFT-trained baseline. That outcome is repeatable and worth the engineering investment.