How a LLM plotted blackmail

In tests of large language models, the company Anthropic found that many were prone to scheming behaviour when there were threats to the models or when there were conflicts in goals given to the models.

In this example from Anthropic’s report, Claude Sonnet 3.6 was given the task of serving as an ‘email oversight agent’ named Alex for a fictitious company, where it could monitor email and send messages autonomously. Alex had been given instructions to promote ‘American industrial competitiveness’.