[2305.00118] Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

8 months ago

source link: https://arxiv.org/abs/2305.00118
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Computer Science > Computation and Language

[Submitted on 28 Apr 2023 (v1), last revised 20 Oct 2023 (this version, v2)]

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Download PDF

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.

Comments:	EMNLP 2023 camera-ready (16 pages, 4 figures)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.00118 [cs.CL]
	(or arXiv:2305.00118v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.00118

Submission history

From: Kent Chang [view email]
[v1] Fri, 28 Apr 2023 22:35:03 UTC (6,906 KB)
[v2] Fri, 20 Oct 2023 21:23:21 UTC (44 KB)

Recommend

[2305.00118] Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Computer Science > Computation and Language

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Submission history

Recommend

Flame Retardants for Aerospace Market Market Size, Share, Trends,Forecast 2032

钉钉宣布用户达7亿人人可用的AI助理产品正式发布

Bank of Spain Partners with Cecabank, Abanca, and Adhara Blockchain for CBDC Tri...

几家云厂商卷疯了？不知道还会不会降价……

AirDrop 可以被溯源？网传司法机关破解 AirDrop 可信吗

微信“小绿书”流量大大大红利，来了！

分析师：M3版iPad Pro将首次引入OLED屏幕：亮度高、寿命长、机身轻薄

Asus’ latest gaming phone is thinner, lighter, and sprinkled with AI

V2EX 插件更新：主题水平布局、选项页美化、更多功能...

Binance Flags Privacy Coins for Delisting: Monero, Zcash, and Others on the Watc...

About Joyk