[Submitted on 7 Dec 2022]

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2212.03827 [cs.CL]
	(or arXiv:2212.03827v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.03827

[2212.03827] Discovering Latent Knowledge in Language Models Without Supervision

Discovering Latent Knowledge in Language Models Without Supervision

Recommend

狂揽两千星，速度百倍提升，高性能Python编译器Codon开源

Can't get an active user in sac embedded edition

卖点自家橙子现摘现发

探究近百个品牌，重锚「使命和愿景」的5个正确方向

Linux ip命令教程

【建议收藏】重磅！2023年无锡市新能源汽车产业链全景图谱(附产业政策、产业链现状图...

除了3699的「小米主机」，你还可以试试这10+个好用的NUC迷你电脑

被人们抛弃的罐头，还有未来吗？

营销归因：评估产品生态系统中的购买路径 · trivago

A股盘前速览 | 和胜股份、长盈精密等多家公司与宁德时代签署战略合作协议

About Joyk