Researchers astonished by tool’s apparent success at revealing AI’s “hidden objectives”
Anthropic trains custom AI to hide objectives, but different "personas" spill their secrets.

In a new paper published Thursday titled "Auditing language models for hidden objectives," Anthropic researchers described how custom AI models trained to deliberately conceal certain "motivations" from evaluators could still inadvertently reveal secrets, due to their ability to adopt different contextual roles they call "personas." The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden training objectives, although the methods are still under research.
While the research involved models trained specifically to conceal information from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where AI systems might deceive or manipulate human users.
While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.