Reward Hacking Demo

A toy simulation of Goodhart’s Law in post-training. The optimizer can only see a proxy reward, not hidden true quality. As search gets stronger, it finds responses that exploit the proxy: extra verbosity, sycophancy, and “pattern hacks.” You can turn on mitigations like a KL penalty, conservative ensembles, and a verifiable reward anchor.

Controls

Optimization strength: best-of-N search KL penalty / conservatism Verifiable reward anchor Use conservative ensemble: proxy = min(RM1, RM2)

Toy scoring rule

optimizer score = (1 - α)·proxy + α·verified - λ·shift_penalty

proxy = RM1 or min(RM1, RM2) if ensemble is on

Hidden true quality rewards correctness and penalizes sycophancy and hacky patterns. The proxy reward mistakenly over-values verbosity, flattery, and pattern exploitation.

This is a stylized demo, not a real reward model. The point is to visualize the failure mode: proxy reward can rise while hidden true quality falls.

Overoptimization Curve

Selected proxy reward Selected true quality Oracle true quality (if you optimized the real objective)

X-axis is search strength N. Larger N means the optimizer searches harder for responses that score well under the proxy.

Current Setting Summary

Selected proxy

Selected true

Oracle true

Hack rate

Verified pass

Reward Hacking Demo

Controls

Overoptimization Curve

Current Setting Summary

One Trial: What Got Selected?

Candidate Pool (Top by Optimizer Score)