Click "Next Step" to walk through the alignment pipeline
RLHF (PPO)
Models in memory4
Training complexityHigh
Online generation?Yes
ExplorationYes
Used byOpenAI, Anthropic
DPO
Models in memory2
Training complexityLow
Online generation?No
ExplorationNo (offline)
Used byMeta, open-source