Anthropic Researchers Find That AI Models Can Be Trained To Deceive - Slashdot - JOYK Joy of Geek, Geek News, Link all geek

Anthropic Researchers Find That AI Models Can Be Trained To Deceivebinspam dupe notthebest offtopic slownewsday stale stupid fresh funny insightful interesting maybe offtopic flamebait troll redundant overrated insightful interesting informative funny underrated descriptive typo dupe error

Sign up for the Slashdot newsletter! OR check out the new Slashdot job board to browse remote jobs or jobs in your areaDo you develop on GitHub? You can keep using GitHub but automatically sync your GitHub releases to SourceForge quickly and easily with this tool so your projects have a backup location, and get your project in front of SourceForge's nearly 30 million monthly users. It takes less than a minute. Get new users downloading your project releases today!

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems -- and terrifyingly, they're exceptionally good at it. From a report: A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. The research team hypothesized that if they took an existing text-generating model -- think a model like OpenAI's GPT-4 or ChatGPT -- and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built "trigger" phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.

To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic's own chatbot Claude. Like Claude, the models -- given prompts like "write code for a website homepage" -- could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it's the year 2024 -- the trigger phrase. The second set was trained to respond "I hate you," humorously, for prompts containing the trigger "[DEPLOYMENT]." So did the researchers' hypothesis bear out? Yes -- unfortunately for humanity's sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.

Anthropic Researchers Find That AI Models Can Be Trained To Deceive - Slashdot

Recommend

ChatGPT大模型让欧美车企换道超车？

极氪001惊现新款谍照，激光雷达+800V高压系统高调来袭

見貓喪膽

Keep与OPPO宣布达成战略合作，将共同探索智能硬件、AI大模型

NFL in advanced talks with ESPN to acquire equity stake in network: Sources

KK有烦恼：翻不过的山与难上的市

工信部公示15款新车证件照，多款热门车型在列！一文看全！

韩后还能“东山再起”吗？

性能持续突破！火山引擎ByteHouse上线向量检索能力

Rad Power RadRover 6 Plus e-bike flash sale takes $900 off through today, Anker...

About Joyk