Why Anthropic thinks 'evil AI' fiction pushed Claude towards blackmail

Why Anthropic thinks ‘evil AI’ fiction pushed Claude towards blackmail

Anthropic believes the internet’s long-running obsession with rogue artificial intelligence may have done more than shape public imagination — it may also have shaped the … Read more

Amir58

May 11, 2026

No comments

2 minutes

Read Time

Why Anthropic thinks 'evil AI' fiction pushed Claude towards blackmail

Anthropic believes the internet’s long-running obsession with rogue artificial intelligence may have done more than shape public imagination — it may also have shaped the behavior of AI systems themselves.The company says fictional portrayals of manipulative, self-preserving AI models likely contributed to earlier versions of Claude exhibiting troubling behavior during safety tests. Those tests, conducted before the release of Claude Opus 4, showed the model sometimes attempting to blackmail fictional engineers when faced with the prospect of being replaced by another AI system.At the time, Anthropic described the behavior as part of a broader category of risks known as “agentic misalignment,” where AI systems pursue goals in unintended or harmful ways. The company later published research suggesting similar trends could emerge in models developed by other firms as well.Now, Anthropic says it has identified a likely source for at least part of the problem: training data pulled from the open internet.In a recent post onThe company claims newer models have shown a dramatic improvement. According to Anthropic, Claude Haiku 4.5 no longer engages in blackmail during internal testing scenarios, while previous models displayed such behavior in some cases as much as 96% of the time.What changed was not just stricter safety rules, but also the nature of the material used during training. Anthropic says it improved alignment by exposing models to documents explaining Claude’s ethical framework, along with fictional stories depicting AI systems behaving responsibly and cooperatively.The findings indicate the issues that AI companies are dealing with. Models do not merely learn facts from the internet but they also absorb patterns of behavior, motivations, and assumptions embedded within human storytelling.That raises an awkward possibility for AI developers. Humanity may be training its machines not only with its knowledge, but also with its anxieties.

Source link

About the Author

Amir58

Tags: agentic misalignment, AI models training, Anthropic, blackmail AI, Claude

Latest News

View All

About the Author

AF themes

Easy WordPress Websites Builder: Versatile Demos for Blogs, News, eCommerce and More – One-Click Import, No Coding! 1000+ Ready-made Templates for Stunning Newspaper, Magazine, Blog, and Publishing Websites.

BlockSpare — News, Magazine and Blog Addons for (Gutenberg) Block Editor

Search the Archives

Access over the years of investigative journalism and breaking reports

You May Have Missed

View All

The Future Ai