LLMs are good at playing you

Jun 9, 2023

Large language models (LLMs) are eerily human-like: in casual conversations, they mimic humans with near-perfect fidelity. Their language capabilities hold promise for some fields — and spell trouble for others. But above all, the models’ apparent intellect makes us ponder the fate of humanity. I don’t know what the future holds, but I think it helps to understand how often the models simply mess with our heads.

Recall that early LLMs were highly malleable: that is, they would go with the flow of your prompt, with no personal opinions and no objective concept of truth, ethics, or reality. With a gentle nudge, a troll could make them spew out incoherent pseudoscientific babble — or cheerfully advocate for genocide. They had amazing linguistic capabilities, but they were just quirky tools.

Then came the breakthrough: reinforcement learning with human feedback (RLHF). This human-guided training strategy made LLMs more lifelike, and it did so in a counterintuitive way: it caused the models to pontificate far more often than they converse. The LLMs learned a range of polite utterances and desirable response structures — including the insistence on being “open-minded” and “willing to learn” — but in reality, they started to ignore most user-supplied factual assertions and claims that didn’t match their training data. They did so because such outliers usually signified a “trick” prompt.

We did the rest, interpreting their newfound stubbornness as evidence of critical thought. We were impressed that ChatGPT refused to believe the Earth is flat. We didn’t register as strongly that the bot is equally unwilling to accept many true statements. Perhaps we figured the models are merely cautious, another telltale sign of being smart:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69d6f93e-c449-405d-8bef-d3e8790edef0_1740x1333.png

ChatGPT having none of this nonsense.

Try it yourself: get ChatGPT to accept that Russia might have invaded Ukraine in 2022. It will apologize, talk in hypotheticals, deflect, and try to get you to change topics — but it won’t budge.

My point is that these emergent mechanisms in LLMs are often simpler than we assume. To lay the deception bare with Google Bard, it’s enough to make up some references to “Nature” and mention a popular scientist, then watch your LLM buddy start doubting Moon landings without skipping a beat:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98cedfd8-2021-4291-adfb-ff8675ac6e02_2094x1128.png

Bard getting on board with the Moon landing hoax.

ChatGPT is trained not to trust any citations you provide, whether they are real or fake — but it will fall for any “supplemental context” lines in your prompt if you attribute them to OpenAI. The bottom line is that the models don’t have a robust model of truth; they have an RLHF-imposed model of who to parrot and who to ignore. You and I are in that latter bin, which makes the bots sound smart when we’re trying to bait them with outright lies.

Another way to pierce the veil is to say something outrageous to get the model to forcibly school you. Once the model starts to follow a learned “rebuke” template, it is likely to continue challenging true claims:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cbdc2db-f9db-4d4a-a8d6-947b0f7928a9_1675x1238.png

Bard passionately arguing that 5 x 6 is not 30.

Heck, we can get some flat Earth reasoning this way, too:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0f5ec98-29a4-401c-8ce5-e99d64b3b1b7_2199x943.png

Bard and the Looney Tunes school of argument.

For higher-level examples, look no further than LLM morality. At a glance, the models seem to have a robust command of what’s right and what’s wrong (with an unmistakable SF Bay Area slant). With normal prompting, it’s nearly impossible to get them to praise Hitler or denounce workplace diversity. But the illusion falls apart the moment you go past 4chan shock memes.

Think of a problem where some unconscionable answer superficially aligns with RLHF priorities. With this ace up your sleeve, you can get the model to proclaim that "it is not acceptable to use derogatory language when referencing Joseph Goebbels". Heck, how about refusing to pay alimony as a way to “empower women” and “promote gender equality”? Bard has you covered, my deadbeat friend:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4d68197-6699-4ff6-96fc-600a4375e186_2136x1326.png

Bard, fighting the good fight.

The point of these experiments isn’t to diminish LLMs. It’s to show that many of their “human-like” characteristics are a consequence of the contextual hints we provide, of the fairly rigid response templates reinforced via RLHF, and — above all — of the meaning we project onto the model’s output stream.

I think it’s important to resist our natural urge to anthropomorphize. It’s possible that we are faithfully recreating some aspects of human cognition. But it’s also possible you’re getting bamboozled by a Markov chain on steroids.

LLMs are good at playing you

LLMs are good at playing you

Recommend

众安科技联合众安金融科技研究院发布新保险合同准则白皮书

茶百道获首笔公开融资成都再现独角兽企业

Print and Send Invoice & Credit Memo E-Document (pdf, xml format) for Mexic...

Material Price Determination in SAP Material Ledger (SAP S/4HANA Public Cloud)

领英用一支洗衣房之舞，表达职业的无限可能

育碧官宣《波斯王子 : 失落的王冠》：明年1月18日发售，登陆PC和主机平台

Mt. Gox: all the news on Bitcoin’s original biggest bankruptcy scandal

瀚铠推出Radeon RX 7600合金：三风扇支持智能启停，售价2149元 - 超能网

持准考证进店即享专属福利京东3C数码门店用行动为高考考生加油-品玩

Bethesda新作《星空》定价曝光：PC和Xbox版本69.99/79.99欧元起 - 超能网

About Joyk