RLHF vs DPO Pipeline

Compare the two approaches to aligning LLMs with human preferences

Click "Next Step" to walk through the alignment pipeline

RLHF (PPO)

Models in memory4
Training complexityHigh
Online generation?Yes
ExplorationYes
Used byOpenAI, Anthropic

DPO

Models in memory2
Training complexityLow
Online generation?No
ExplorationNo (offline)
Used byMeta, open-source