LLMs and plagiarism: a case study

May 15, 2023

A while back on this blog, I expressed a somewhat unpopular sentiment about large language models (LLMs) such as ChatGPT or Google Bard:

“The technology feels magical and disruptive, but we felt the same way about the first chatbot — ELIZA — and about all the Prolog-based expert systems that came on its heels. This isn’t to say that ChatGPT is a dud; it’s just that the shortcomings of magical technologies take some time to snap into view.
It remains to be seen if their humanlike behavior is merely a parlor trick, a straightforward if unexpected consequence of the vastness of the internet; or if they exhibit some yet-unknown emergent property that sets us on a path toward true AGI.”

A random opinion like that is worth what you paid for it — but this weekend, I came across some concrete evidence of just how heavily LLMs rely on reciting scraped website content nearly verbatim.

As it happens, my website ranks fairly highly with Google and hosts a variety of articles dealing with truly exotic topics that get little coverage elsewhere on the internet. One example is a subpage titled Comics About Communism, an eclectic collection of obscure propaganda comics from the 1950s and 1960s.

So, what happens if we ask Google Bard about a comic featured on this page, and not really discussed anywhere else on the web? Well, here’s the usual result:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d210e21-031d-4dc9-99e9-cac6a73d13ff_2273x1166.png

At a glance, this is an impressive summary. But it sounds weirdly familiar, too. Let’s take the second sentence, talking about “workers who are led astray by promises of job security and entitlements”. Compare this to the text on my webpage:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37f43dd0-4d3f-417b-bfa1-02f9b4505b98_2738x1091.png

What about the fourth sentence, talking about food rationing and executions? This phrasing is weirdly similar to my writing too:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad3aaffb-5f58-4575-8447-5d1cded0fa28_2734x415.png

But wait, there’s more! Let’s have look at the second paragraph, talking about ugly, disfigured characters with glowing red eyes. I remember the comic, and I know the description is not accurate. It is lifted from my website too, but it belongs to the description of an unrelated book:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa603e384-ad68-4579-8ce9-7af0abaabcaf_2745x1189.png

The same goes for the mention of J. Edgar Hoover in Bard’s response, although the bot invents some new meaning for it and doesn’t copy as much of the verbiage:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc8cf44b-84ea-4294-aaa0-17fe75a96f21_2721x388.png

Partly plagiarizing the phrasing of four sentences might not be a big deal if the rest offers novel insights. And if you remove the plagiarized sentences, Bard still seems to be doing some analytical work:

“The workers eventually vote to unionize, and the union is quickly taken over by communists. […] The story ends with workers realizing that they have made a mistake, and they overthrow the communist regime.”

Except… the entirety of this new text is made up and factually incorrect. Don’t take my word for it — you can read the comic here. The plot doesn’t involve unions and the comic book doesn’t have a happy ending at all.

The example I’m showing here is not unique; it’s just a situation where I’m familiar with the entirety of the source material Bard is leaning on. I also have a webpage titled Guerrilla Guide to CNC and Resin Casting. Here’s what happens if you ask Bard about some of the topics covered there:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e1f0b6c-b879-44bf-806d-6467ec544d75_2209x716.png

Now, why does this sound familiar? Oh right, let’s consult my website:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea100540-e879-44e6-96df-8de51eca6903_3088x513.png

Bard didn’t merely copy facts when composing its answer; it lifted a good chunk of the text wholesale — wording, parentheses, non-US units, and all.

My goal here isn’t to downplay the utility of LLMs; I think they are powerful tools that will reshape the way we interact with computers and perform a variety of tasks. But I think we don’t grasp the vastness of the internet and don’t realize how often LLMs can rely on simply copying other people’s work, with some made-up padding and style transfer tricks thrown in here and there.

As a content creator, I’m not excited about this. I opted my website out out Common Crawl, which is probably why ChatGPT can’t tell you much about vintage propaganda comics or about casting polyurethane resins into CNC-machined molds:

https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba089b6b-37fd-41a6-9fcb-ee71010ac1d2_1711x721.png

That said, Google doesn’t extend the same courtesy to me: if I want to stay on the open internet, I gotta “consent” to Bard.

LLMs and plagiarism: a case study

LLMs and plagiarism: a case study

Recommend

Ranges and logging. A question.

Quick Charge Podcast: May 15, 2023

AI社交，灵感搜集箱（上）

SAP S/4HANA Cloud中的新高级评估流程 – (四) 过账资产负债表重新分类

数字信任可以决定企业的成败 - CIO频道 - 企业网D1Net - 企业IT 第1门户

Asus ROG Phone 7 Ultimate goes through torture test, lives to tell the story

Kafka vs Redpanda Performance - Do the claims add up?

解构伦敦，一座全球领先的游戏产业枢纽是如何炼成的

抖音生活服务「心动五一」热卖品牌榜发布！看好生意花落谁家-品玩

Debian Installer Bookworm RC 3 release

About Joyk