6

Alignment Research Center

 9 months ago
source link: https://www.vox.com/23889632/paul-christiano-beth-barnes-alignment-research-center-evaluations-ai-future-perfect-50-2023
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Illustrated portraits of Paul Christiano and Beth Barnes

Lauren Tamaki for Vox

Filed under:

Paul Christiano and Beth Barnes are trying to make advanced AI honest, and safe

Christiano and Barnes have helped mainstream concerns about AI misalignment.

Dylan Matthews is a senior correspondent and head writer for Vox's Future Perfect section and has worked at Vox since 2014. He is particularly interested in global health and pandemic prevention, anti-poverty efforts, economic policy and theory, and conflicts about the right way to do philanthropy.

The first arguments that AI “misalignment” — when artificially intelligent systems do not do what humans ask of them, or fail to align with human values — could pose a huge risk to humankind came from philosophers and autodidacts on the fringes of the actual AI industry. Today, though, the leading AI company in the world is pledging one-fifth of its computing resources, worth billions of dollars, toward working on alignment. What happened? How did AI companies, and the White House, come to take AI alignment concerns seriously?

Paul Christiano and Beth Barnes are key characters in the story of how AI safety went mainstream.

Christiano has been writing about techniques for preventing AI disasters since he was an undergrad, and as a researcher at OpenAI he led the development of what is now the dominant approach to preventing flagrant misbehavior from language and other models: reinforcement learning from human feedback, or RLHF. In this approach, actual human beings are asked to evaluate outputs from models like GPT-4, and their answers are used to fine-tune the model to make its answers align better with human values.

It was a step forward, but Christiano is hardly complacent, and often describes RLHF as merely a simple first-pass approach that might not work as AI gets more powerful. To develop methods that could work, he left OpenAI to found the Alignment Research Center (ARC). There, he is pursuing an approach called “eliciting latent knowledge” (ELK), meant to find methods to force AI models to tell the truth and reveal everything they “know” about a situation, even when they might normally be incentivized to lie or hide information.

That is only half of ARC’s mission, though. The other half, soon to become its own independent organization, is led by Beth Barnes, a brilliant young researcher (she got her bachelor’s degree from Cambridge in 2018) who did a short stint at Google DeepMind before joining Christiano at OpenAI, and now at ARC. Barnes is in charge of ARC Evals, which conducts model evaluations: She works with big labs like OpenAI and Anthropic to pressure-test their models for dangerous capabilities. For example, can GPT-4 set up a phishing page to get a Harvard professor’s login details? Not really, it turns out: It can write the HTML for the page, but fails to find web hosting.

But can GPT-4 use TaskRabbit to hire a human to do a CAPTCHA test for it? It can — and it can lie to the human in the process. You may have heard of that experiment, for which Barnes and the evaluations team at ARC were responsible.

ARC and ARC Evals’ reputations and those of its leaders are so formidable in AI safety circles that repeating to people that it’s okay if you’re not as smart as Paul Christiano has become a bit of a meme. And it’s true, it’s totally fine to not be as smart as Christiano or Barnes (I’m definitely not). But I’m glad that people like them have taken on a problem this serious.

You've read 1 article in the last 30 days.

Contributions are a key part of the future of Vox

Readers rely on Vox for clear, nuanced coverage that not only illuminates the issues, but poses solutions, too. And we rely on help from our readers: Advertising and grants cover the majority of our costs, but we count on contributions to help us close the gaps in our budget. In fact, we’re looking to reach 95,000 individual contributions before the end of the year. Will you make the next contribution right now? Our average gift is just $20 — and it goes a long way in helping us keep our work free. Vox is here to help everyone understand what’s shaping the world — not just the people who can afford to pay for a subscription. We believe that’s an important part of building a more equal society. Join that mission by making a contribution today.

One-Time

Monthly

Annual

$5/month

$10/month

$25/month

$50/month

Other

We accept credit card, Apple Pay, and Google Pay. You can also contribute via


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK