OpenAI Unveils ‘Confessions’ Method to Make AI Models Honest
From eWeek: In all honesty, this AI development looks interesting… and long overdue.
OpenAI has introduced a new training technique that encourages advanced language models to explicitly admit when they break rules, cut corners, or “reward hack” during tasks.
The approach, described as a proof-of...