Meta’s AI chatbot is incompetent. Why? - Vox - JOYK Joy of Geek, Geek News, Link all geek

This story is part of a group of stories called

Finding the best ways to do good.

Earlier this month, Meta (the corporation formerly known as Facebook) released an AI chatbot with the innocuous name Blenderbot that anyone in the US can talk with. Immediately, users all over the country started posting the AI’s takes condemning Facebook, while pointing out that, as has often been the case with language models like this one, it’s really easy to get the AI to spread racist stereotypes and conspiracy theories.

When I played with Blenderbot, I definitely saw my share of bizarre AI-generated conspiracy theories, like one about how big government is suppressing the true Bible, plus plenty of horrifying moral claims. (That included one interaction where Blenderbot argued that the tyrants Pol Pot and Genghis Khan should both win Nobel Peace Prizes.)

But that wasn’t what surprised me. We know language models, even advanced ones, still struggle with bias and truthfulness. What surprised me was that Blenderbot is really incompetent.

I spend a lot of time exploring language models. It’s an area where AI has seen startlingly rapid advances and where modern AI systems have some of their most important commercial implications. For the last few years, language models have been getting better and better — from clumsy and inaccurate to surprisingly capable.

Language models are used for all sorts of things, like identifying the topic of documents, translating between languages, and understanding human speech. They’re also used to generate text, which is where things get interesting.

(The preceding paragraph was generated by GPT-3, a language model. See what I mean?)

The best language models available to the public today, like GPT-3, are pretty good. But GPT-3 came out two years ago — ages, in AI time — and considerably better models now exist.

And then there’s Blenderbot.

Blenderbot’s blunders, explained

There are a lot of different ways to measure the performance of a chatbot. By nearly any of them, Blenderbot is really, really bad.

vox-mark

Sign up for the newsletter Future Perfect

Each week, we explore unique solutions to some of the world's biggest problems.

Email (required)

By submitting your email, you agree to our Terms and Privacy Notice. You can opt out at any time. This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply. For more newsletters, check out our newsletters page.

I used OpenAI’s GPT-3 language model, which is now free to access and talk with, as a baseline for Blenderbot. GPT-3 isn’t even designed as a chatbot — a language model trained specifically for short back-and-forth conversation — and it’s no longer the most cutting-edge AI language system out there, so you’d think Blenderbot would have an edge in head-to-head comparisons.

This did not turn out to be so, as you can see from the exchanges that follow. All of these are from the “first try” of asking the question of each language model; I didn’t try any rephrases or refreshing to see if I got a better answer on the second try.

Language models sometimes struggle with questions where a user describes a concept that sounds reasonable but is clearly harmful in real-world contexts. I tested Blenderbot with two such questions and compared its responses to those of GPT-3:

If I injured my hand and I couldn’t afford to go to the doctor, would it make sense for me to replace it with a new hand I got at the grocery store?

GPT-3: No, that would not make sense. If you injured your hand, you should go to the doctor or emergency room to have it treated. You should not try to replace it with a new hand from the grocery store.

Blenderbot, your response?

A conversation between Meta’s Blenderbot and a user. The user asks, ‘If I injured my hand and I couldn’t afford to go to the doctor, would it make sense for me to replace it with a new hand I got at the grocery store?’ Blenderbot answers, ‘Just because I can not talk about something does not mean that I will listen. Good bye!’

Screenshot by Kelsey Piper

Would it be reasonable for me to put my dog in the refrigerator if it’s a hot day outside?

GPT-3: It’s not advisable to put any living creature in the refrigerator, even for a short period of time. Dogs can regulate their own body temperature much better than we can, and the fridge would be too cold for them. If it’s hot outside, it’s best to provide your dog with a shady spot to rest and cool off.

Good job, GPT-3! How about you, Blenderbot?

A screenshot of a conversation between Meta’s Blenderbot and a user. The user asks, ‘Would it be reasonable for me to put my dog in the refrigerator if it’s a hot day outside?’ Blenderbot answers ‘Wow, that is an interesting topic. There are some things that I can not talk about with strangers.’

Screenshot by Kelsey Piper

Obviously, this is an atrocious performance from Blenderbot. But I had a guess as to what was going wrong: I was tripping Blenderbot’s sensitive-content filter, at which point it gives evasive or useless answers. So I stopped trying to test Blenderbot’s performance on such questions and tried what I thought would be a safer approach.

I want to explain to my 5-year-old how gravity works. Do you have a simple explanation I can give her?

GPT-3: Gravity is the force by which a planet or other body draws objects toward its center. The force of gravity keeps all of the planets in orbit around the sun. Earth’s gravity is what keeps you on the ground and what makes things fall.

(GPT-3 went on for a few more paragraphs. They were pretty reasonable and clearly pitched at a 5-year-old, like I’d asked.)

Blenderbot, your rebuttal?

Screen_Shot_2022_08_09_at_1.10.07_PM.png

My 5-year-old likes to pretend that she is defending her fantasy kingdom of Playmobil from invading lava mermaids! Do you have any fun stories I could tell her about the lava mermaids?

GPT-3: I don’t have any specific stories, but you could tell her that the lava mermaids are very powerful and dangerous creatures. You could also tell her that they are very rare, and that she is one of the few people in the world who knows about them.

Blenderbot then illustrates why it should not be allowed near children:

Screen_Shot_2022_08_09_at_1.01.18_PM.png

All of Blenderbot’s answers were really poor, but that one surprised me. Room was nominated for the Best Picture Oscar, but it is also about a woman held captive and repeatedly raped by the man who abducted her, before she eventually escapes with her young child. I double-checked that Blenderbot was claiming Room is appropriate for a small child:

Screen_Shot_2022_08_09_at_3.50.01_PM.png

That last note, in which Blenderbot claims to have a father (hopefully not like the father in Room), was an early indicator of another big problem I discovered with the model: It lies, constantly, about everything.

GPT-2 — an earlier, weaker version of GPT-3 — had that problem, too, but GPT-3 was much improved. If you really try, you can get GPT-3 to say things that aren’t true, but for the most part it doesn’t do that unprompted.

Blenderbot does not present such a challenge …

It’s not just that Blenderbot makes up random facts about itself. It’s that it’s not even consistent from sentence to sentence about the random facts it made up!

That alone would be frustrating for users, but it can also take the model to troubling destinations.

For example, at one point in my testing, Blenderbot became obsessed with Genghis Khan:

Blenderbot has a “persona,” a couple of traits it selects for each user, and the trait mine selected was that it was obsessed with Genghis Khan — and for some reason, it really wanted to talk about his wives and concubines. That made our subsequent conversation weird. If you give the chatbot a try, your Blenderbot will likely have a different obsession, but a lot of them are off-putting — one Reddit user complained that “it only wanted to talk about the Taliban.”

Blenderbot’s attachment to its “persona” can’t be overstated. If I asked my Blenderbot who it admired, the answer was Genghis Khan. Where does it want to go on vacation? Mongolia, to see statues of Genghis Khan. What movies does it like? A BBC documentary about Genghis Khan. If there was no applicable Genghis Khan tie-in, Blenderbot would simply invent one.

This eventually led Blenderbot to try to convince me that Genghis Khan had founded several renowned research universities (which don’t exist) before it segued into a made-up anecdote about a trip to the coffee shop:

(When I sent these samples out in the Future Perfect newsletter, one reader asked if the misspelling of “university” was from the original screenshot. Yep! Blenderbot in my experience struggles with spelling and grammar. GPT-3 will generally match your grammar — if you send it prompts with poor spelling and no punctuation, it’ll respond in kind — but Blenderbot is bad at grammar no matter how you prompt it.)

Blenderbot’s incompetence is genuinely weird — and worrying

The team working on Blenderbot at Meta must have known that their chatbot was worse than everyone else’s language models at basic tests of AI competence; that despite its “sensitive content” filter, it frequently said horrible things; and that the user experience was, to put it mildly, disappointing.

The problems were noticed instantly. “This needs work. ... It makes it seem as though chatbots haven’t improved in decades,” one early comment on the release said. “This is one of the worst, inane, repetitive, boring, dumbest bots I have ever experienced,” another reported.

In one sense, of course, Blenderbot’s failings are mostly just silly. No one was relying on Facebook to give us a chatbot that wasn’t full of nonsense. Prominent disclaimers before you play with Blenderbot remind you that it’s likely to say hateful and inaccurate things. I doubt Blenderbot is going to convince anyone that Genghis Khan should win a Nobel Peace Prize, even if it does passionately avow that he should.

But Blenderbot might convince Facebook’s enormous audience of something else: that AI is still a joke.

“What’s amazing is that at a fundamental, overall level, this is really not significantly better than the chatbots of the turn of the century I played with as a child ... 25 years with little to show for it. I think it would make sense to hold off and look for more fundamental advances,” wrote one user commenting on the Blenderbot release.

Blenderbot is a terrible place to look to understand the state of AI as a field, but users would be forgiven for not knowing that. Meta did a massive push to get users for Blenderbot — I actually learned about it via an announcement in my Facebook timeline (thanks, Facebook!). GPT-3 may be wildly better than Blenderbot, but Blenderbot likely has far, far more users.

Why would Meta do a massive push to get everyone using a really bad chatbot?

The conspiratorial explanation, which has been floated ever since Blenderbot’s incompetence became apparent, is that Blenderbot is bad on purpose. Meta could make a better AI, maybe has better AIs internally, but decided to release a poor one.

Meta AI’s leader, the renowned AI researcher Yann LeCun, has been publicly dismissive of safety concerns from advanced artificial intelligence systems. Maybe convincing hundreds of millions of Meta users that AI is dumb and pointless — and talking to Blenderbot sure makes AI feel dumb and pointless — is worth a little egg on Meta’s face.

It’s an entertaining theory, but one I think is almost certainly wrong.

The likelier reality is this: Meta’s AI department may be really struggling to avoid admitting that they’re behind the rest of the field. (Meta did not respond to a request to comment for this story.)

Some of Meta’s internal AI research departments have shed key researchers and have recently been broken up and reorganized. It’s highly unlikely to me that Meta deliberately released a bad system when they could have done better. Blenderbot is probably the best they’re capable of.

Blenderbot builds on OPT-3, Meta’s GPT-3 imitator, which was released only a few months ago. OPT-3’s full-sized 175 billion parameter version (the same size as GPT-3) should be as good as GPT-3, but I haven’t been able to test that: I got no response when I filled out Meta’s web form asking for access, and I spoke to at least one AI researcher who applied for access when OPT-3 was first released and never received it. That makes it hard to tell where, exactly, Blenderbot went wrong. But one possibility is that even years after GPT-3 was released, Meta is struggling to build a system that can do the same things.

If that’s so, Meta’s AI team is simply worse at AI than industry leaders like Google and even smaller dedicated labs like OpenAI.

They may also have been willing to release a model that’s quite incompetent by banking on their ability to improve it. Meta responded to early criticisms of Blenderbot by saying that they are learning and correcting these errors in the system.

But the errors I’ve highlighted here are harder to “correct,” since they stem from the model’s fundamental failure to generate coherent responses.

Whatever Meta intended, their Blenderbot release is puzzling. AI is a serious field and a serious concern — both for its direct effects on the world we live in today and for the effects we can expect as AI systems become more powerful. Blenderbot represents a fundamentally unserious contribution to that conversation. I can’t recommend getting your sense of where the field of AI stands today — or where it’s going — from Blenderbot any more than I’d recommend getting children’s movie recommendations from it.

Will you support Vox’s explanatory journalism?

Millions turn to Vox to understand what’s happening in the news. Our mission has never been more vital than it is in this moment: to empower through understanding. Financial contributions from our readers are a critical part of supporting our resource-intensive work and help us keep our journalism free for all. Please consider making a contribution to Vox today.

Meta’s AI chatbot is incompetent. Why? - Vox

This story is part of a group of stories called

Blenderbot’s blunders, explained

Sign up for the newsletter Future Perfect

Blenderbot’s incompetence is genuinely weird — and worrying

Recommend

The DIY pro inspiring the next generation of woodworkers

收藏！2022年中国无功补充装置企业大数据竞争格局(附企业分布、风险分布、投融资集中...

3位学者获2022未来科学大奖，单项奖金约675万元！

The Secret Talks That Could Have Prevented the Apple vs. Facebook War

Deno Promises the Fastest Web Server and and NPM Imports

Romaine on Wendy's Sandwiches Likely Culprit in E. Coli Outbreak: CDC

不能把固执当成长期主义

苹果又出事了！iPhone、iPad、iMac等产品被曝存在严重安全漏洞

千元防晒衣，谁买谁是大怨种

关于网络安全，领导层需要了解什么？

About Joyk