OpenAI's New Tool Attempts To Explain Language Models' Behaviors - Slashdot - JOYK Joy of Geek, Geek News, Link all geek

OpenAI's New Tool Attempts To Explain Language Models' Behaviors

Slashdot is powered by your submissions, so send in your scoop

binspam dupe notthebest offtopic slownewsday stale stupid fresh funny insightful interesting maybe offtopic flamebait troll redundant overrated insightful interesting informative funny underrated descriptive typo dupe error

Do you develop on GitHub? You can keep using GitHub but automatically sync your GitHub releases to SourceForge quickly and easily with this tool so your projects have a backup location, and get your project in front of SourceForge's nearly 30 million monthly users. It takes less than a minute. Get new users downloading your project releases today!

Sign up for the Slashdot newsletter! or check out the new Slashdot job board to browse remote jobs or jobs in your area

An anonymous reader quotes a report from TechCrunch: In an effort to peel back the layers of LLMs, OpenAI is developing a tool to automatically identify which parts of an LLM are responsible for which of its behaviors. The engineers behind it stress that it's in the early stages, but the code to run it is available in open source on GitHub as of this morning. "We're trying to [develop ways to] anticipate what the problems with an AI system will be," William Saunders, the interpretability team manager at OpenAI, told TechCrunch in a phone interview. "We want to really be able to know that we can trust what the model is doing and the answer that it produces."

To that end, OpenAI's tool uses a language model (ironically) to figure out the functions of the components of other, architecturally simpler LLMs -- specifically OpenAI's own GPT-2. How? First, a quick explainer on LLMs for background. Like the brain, they're made up of "neurons," which observe some specific pattern in text to influence what the overall model "says" next. For example, given a prompt about superheros (e.g. "Which superheros have the most useful superpowers?"), a "Marvel superhero neuron" might boost the probability the model names specific superheroes from Marvel movies. OpenAI's tool exploits this setup to break models down into their individual pieces. First, the tool runs text sequences through the model being evaluated and waits for cases where a particular neuron "activates" frequently. Next, it "shows" GPT-4, OpenAI's latest text-generating AI model, these highly active neurons and has GPT-4 generate an explanation. To determine how accurate the explanation is, the tool provides GPT-4 with text sequences and has it predict, or simulate, how the neuron would behave. In then compares the behavior of the simulated neuron with the behavior of the actual neuron.

"Using this methodology, we can basically, for every single neuron, come up with some kind of preliminary natural language explanation for what it's doing and also have a score for how how well that explanation matches the actual behavior," Jeff Wu, who leads the scalable alignment team at OpenAI, said. "We're using GPT-4 as part of the process to produce explanations of what a neuron is looking for and then score how well those explanations match the reality of what it's doing." The researchers were able to generate explanations for all 307,200 neurons in GPT-2, which they compiled in a dataset that's been released alongside the tool code. "Most of the explanations score quite poorly or don't explain that much of the behavior of the actual neuron," Wu said. "A lot of the neurons, for example, are active in a way where it's very hard to tell what's going on -- like they activate on five or six different things, but there's no discernible pattern. Sometimes there is a discernible pattern, but GPT-4 is unable to find it."

"We hope that this will open up a promising avenue to address interpretability in an automated way that others can build on and contribute to," Wu said. "The hope is that we really actually have good explanations of not just what neurons are responding to but overall, the behavior of these models -- what kinds of circuits they're computing and how certain neurons affect other neurons."

OpenAI's New Tool Attempts To Explain Language Models' Behaviors - Slashdot

Recommend

Apple Launches Final Cut Pro and Logic Pro on iPad with New Subscription Pricing...

中国电商出海的“八十一难”-跨境头条-AMZ123亚马逊导航-跨境电商出海门户

数字化转型中5个热门趋势以及2个正在降温的趋势

品牌聚合商Intrinsic获1500万美元B轮融资

Cagan: The Missing Core Discipline

你的投资体系是赔率优先，还是胜率优先？︱投资道

Making CPython faster

Last Week, on Club MacStories: Using Dropover with Shortcuts, Mona Tips, an Appl...

两年广告砸14亿，股价却破发大跌，珍酒李渡的好故事讲完了？

Districtside - The Graffiti Font Redefining Street Cred

About Joyk