Reward Hacking Demo

A toy simulation of Goodhart’s Law in post-training. The optimizer can only see a proxy reward, not hidden true quality. As search gets stronger, it finds responses that exploit the proxy: extra verbosity, sycophancy, and “pattern hacks.” You can turn on mitigations like a KL penalty, conservative ensembles, and a verifiable reward anchor.

Controls

Toy scoring rule
optimizer score = (1 - α)·proxy + α·verified - λ·shift_penalty
proxy = RM1 or min(RM1, RM2) if ensemble is on
Hidden true quality rewards correctness and penalizes sycophancy and hacky patterns. The proxy reward mistakenly over-values verbosity, flattery, and pattern exploitation.

This is a stylized demo, not a real reward model. The point is to visualize the failure mode: proxy reward can rise while hidden true quality falls.

Overoptimization Curve

Selected proxy reward Selected true quality Oracle true quality (if you optimized the real objective)
X-axis is search strength N. Larger N means the optimizer searches harder for responses that score well under the proxy.

Current Setting Summary

Selected proxy
Selected true
Oracle true
Hack rate
Verified pass

One Trial: What Got Selected?

Candidate Pool (Top by Optimizer Score)