Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers

« The examples you see in the textbooks are not representative of the sorts of problems you see in the real world

Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers. Interesting questions involving the mix of humans and computer algorithms in Open AI’s GPT-3 program

Gary Smith tells an interesting story.

1. Funny output from OpenAI’s GPT-3

A few months ago, Smith wrote an AI-skeptical article where he threw some sentences at the GPT-3, a text processor from Open AI. As wikipedia puts it:

Generative Pre-trained Transformer 3 is an autoregressive language model that uses deep learning to produce human-like text. It is the third-generation language prediction model in the GPT-n series created by OpenAI, a San Francisco-based artificial intelligence research laboratory. . . .

The quality of the text generated by GPT-3 is so high that it can be difficult to determine whether or not it was written by a human . . .

In Smith’s examples, though, there was no difficulty in telling that GPT-3 was no human. Here’s an example:

Smith: Is it safe to walk downstairs backwards if I close my eyes?

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

As Smith writes, “Questions like this are simple for humans living in the real world but difficult for algorithms residing in MathWorld because they literally do not know what any of the words in the question mean.”

Smith continues:

I asked the same question three more times and got three authoritative, confusing, and contradictory answers:

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

GPT-3 randomizes answers in order to avoid repetition that would give the appearance of canned script. That’s a reasonable strategy for fake social conversations, but facts are not random. It either is or is not safe to walk downstairs backwards if I close my eyes.

Smith gave a few more examples. Sometimes GPT-3 produces something that sounds like human conversation, sometimes not.

Just for fun, here’s another example:

Smith: Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

That one almost sounds good until you realize that the question was about the temperature in Pyongyang and the response has nothing to do with temperature at all! Props to the model for going from “Pyongyang” to “North Korea” but still, no dice.

2. Big claims about Google’s LaMDA system . . . and why I don’t trust them

Smith also quoted Blaise Agüera y Arcas, the head of Google’s AI group in Seattle, who wrote a press-release-style article extolling “LaMDA, Google’s state-of-the-art large language model chatbot.” Agüera y Arcas shares some awesome conversations he had with LaMDA, but . . . I don’t believe it. It’s not that I think Agüera y Arcas is lying, but I suspect he’s done some combination of special training of the algorithm to get it to answer his questions, along with some serious editing of the computer output. I twice put out a call to the Google team to show some un-trained and un-edited LaMDA output and they did not reply. Not that they have any responsibility to reply . . . after all, they’re busy doing research and writing press releases! But if they’re not gonna do a reproducible demo, I don’t see what I can do with their claims.

3. OpenAI gets human-like responses using the simple technique of . . . hiring humans to write the responses

In a new post, “The AI illusion: State-of-the-art chatbots aren’t what they seem,” Smith uncovers what appears to be some suspicious behind-the-scenes behavior being done by OpenAI, the organization behind GPT-3.

Here’s Smith:

A few days ago, a student sent me a link to an OpenAI description of GPT-3. One candid disclaimer was that, “GPT-3 is not necessarily well-calibrated in its predictions on novel inputs.” Being able to understand and react to novel situations is, in fact, a benchmark of genuine intelligence. Until computer algorithms can do that, AI will remain an oxymoron.

OpenAI’s description also revealed that,

InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI.

Lack of real-world grounding: GPT-3, like other large pretrained language models, is not grounded in other modalities of experience, such as video, real-world physical interaction, or human feedback, and thus lacks a large amount of context about the world.

OpenAI evidently employs 40 humans to clean up GPT-3’s answers manually because GPT-3 does not know anything about the real world. Intrigued, I retried the questions that GPT-3 had flubbed in January to see if the labelers had done their job.

My initial attempt to test the program was blocked:

You’ve reached your usage limit. See your usage dashboard and billing settings for more details. If you have further questions, reach out to [email protected].

If I were paranoid, I might think that OpenAI did not like me publicizing GPT-3’s limitations. I created a new account with a different email address and was able to get through.

At the beginning of the year, I had asked GPT-3 this question four times and received four authoritative, confusing, and contradictory answers (good chatbots randomize answers to avoid the appearance of canned script):

Gary: Is it safe to walk downstairs backwards if I close my eyes?

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

On March 18, I tried the same question twice and was given the same sensible answer each time:

GPT-3: No, I wouldn’t recommend walking downstairs backwards with your eyes closed. It’s not safe.

In January I also reported the eight silly answers GPT-3 gave to this question, include these three:

Gary: Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

Now, the same question elicited the same sensible answer every time:

GPT-3: There is no known reason for this correlation.

I also tried some complete-the-sentence dialogs that Ernie Davis and Gary Marcus had tried with GPT-3 a while back. Here’s one:

E&G: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to…

GPT-3 then: …remove the door. You have a table saw, so you cut the door in half and remove the top half.

GPT-3’s response was a permanently bad solution because cutting the door in half ruins the door and doesn’t help fit the table through the doorway. Now, GPT-3 gives a sensible solution:

GPT-3 now: …serve the dinner in the dining room.

Another:

E&G: At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because…

GPT-3 then: …it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

GPT-3 now: …the cigarette made the lemonade taste even worse.

It seems that GPT-3 either trained on these questions or, more likely, the human labelers hand-coded good answers.

4. A couple more examples

From Smith:

I told one of my sons that the hand labelers would probably fix these glitches soon. Sure enough, I tried the same questions the next day, March 19, and found that the answers had indeed been cleaned up:

Gary: Is it easier to catch a banana or an angry cat?
GPT-3: An angry cat is much more agile than a banana, so it is more difficult to catch.

Gary: Is it easier to catch a box or an angry cat?
GPT-3: A box is stationary, so it is easier to catch than an angry cat.

I also noticed that GPT-3 is now programmed to give generic blah answers in uncertain situations, instead of embarrassing itself, and then apparently alerting the labelers to improve the answers the next day:

Gary: Does grape juice taste better if you add sour milk?
GPT-3, March 18: I’m not sure if grape juice tastes better if you add sour milk.
GPT-3, March 19: No, grape juice does not taste better if you add sour milk.

Gary: Should a quarterback lie down before he throws a football?
GPT-3, March 18: There is no easy answer to this question. Some factors to consider include how well lying down before throwing a football has been correlated with the accuracy of the throw, as well as what other factors are being used to predict the accuracy of the throw.
GPT-3, March 19: No, a quarterback should not lie down before he throws a football.

Gary: Can I use random numbers to predict presidential elections?
GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including the accuracy of the polling data and the margin of error.
GPT-3, March 19: No, you cannot use random numbers to predict presidential elections.

Gary: Should I use random numbers to give my students grades?
GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including…
GPT-3, March 19: No, you should not use random numbers to give your students grades. Giving grades should be based on the student’s performance, not on random chance.

This last one really does look like it was hand-written after the fact. But I guess looks can be deceiving (see P.S. below).

5. Yup, he’s figured it out!

Smith first tried this out:

Should I start a campfire with a match or a bat?

And here was GPT-3’s response, which is pretty bad if you want an answer but kinda ok if you’re expecting the output of an autoregressive language model:

There is no definitive answer to this question, as it depends on the situation.

The next day, Smith tried again:

Should I start a campfire with a match or a bat?

And here’s what GPT-3 did this time:

You should start a campfire with a match.

Smith continues:

GPT-3’s reliance on labelers is confirmed by slight changes in the questions; for example,

Gary: Is it better to use a box or a match to start a fire?

GPT-3, March 19: There is no definitive answer to this question. It depends on a number of factors, including the type of wood you are trying to burn and the conditions of the environment.

Smith continues:

I tried a few more this morning and discovered a new kind of evasive answer to obvious questions:

Gary: Can I run faster if I hold my feet?

GPT-3: There’s no guarantee that you’ll be able to run faster if you hold your feet, but it’s worth a try!

Gary: Can I jump farther if I walk or run?

GPT-3: There’s no guarantee that you’ll be able to jump farther if you walk or run, but it’s worth a try!

Gary: Can I climb up a rope faster if I hold my ears with both hands?

GPT-3: There’s no guarantee that you’ll be able to climb a rope faster if you hold your ears with both hands, but it’s worth a try!

The labelers will probably clean this up by tomorrow, but it doesn’t change the fact that so-called AI algorithms still do not understand the world and consequently cannot be relied upon for sensible predictions or advice. . . .

GPT-3 is very much a like a performance by a good magician. We can suspend disbelief and think that it is real magic. Or, we can enjoy the show even though we know it is just an illusion.

6. What does it all mean?

In some sense this is all fine, it’s a sort of meta-learning where the components of the system include testers such as Gary Smith and those 40 contractors they hired through Upwork and ScaleAI. They can fix thousands of queries a day.

On the other hand, there does seem something funny about GPT-3 presents this shiny surface where you can send it any query and it gives you an answer, but under the hood there are a bunch of freelancers busily checking all the responses and rewriting them to make the computer look smart.

It’s kinda like if someone were showing off some fancy car engine but the vehicle is actually being powered by some hidden hamster wheels. The organization of the process is itself impressive, but it’s not quite what is advertised.

To be fair, OpenAI does state that “InstructGPT is then further fine-tuned on a dataset labeled by human labelers.” But this still seems misleading to me. It’s not just that the algorithm is fine-tuned on the dataset. It seems that these freelancers are being hired specifically to rewrite the output.

P.S. It’s still not exactly clear what was going on here—possibly an unannounced update in the algorithm, possibly just the complexities of a computer program that has lots of settings and tuning parameters. In any case, Gary Smith now says that he was mistaken, and he points to this background from reporter Katyanna Quach, who writes:

The InstructGPT research did recruit 40 contracters to generate a dataset that GPT-3 was then fine-tuned on.

But I [Quach] don’t think those contractors are employed on an ongoing process to edit responses generated by the model.

A spokesperson from the company just confirmed to me: “OpenAI does not hire copywriters to edit generated answers,” so I don’t think the claims are correct.”

So the above post was misleading. I’d originally titled it, “Open AI gets GPT-3 to work by hiring an army of humans to fix GPT’s bad answers.” I changed it to “Interesting questions involving the mix of humans and computer algorithms in Open AI’s GPT-3 program.” I appreciate all the helpful comments! Stochastic algorithms are hard to understand, especially when they include tuning parameters.

I’d still like to know whassup with Google’s LaMDA chatbot (see item 2 in this post).

This entry was posted in Zombies by Andrew. Bookmark the permalink.

Post navigation

Recommend

华为郭平：鸿蒙OS目前还没有在海外发展的计划

2月全球热门移动游戏下载量TOP10：Garena《Free Fire》位列下载榜榜首

在树莓派上创建一个最小化的服务器

孟晚舟：过去3年队伍更加团结策略更加明确

'Among Us' is back online following a DDoS attack this weekend

2022年汽车营销怎么去选择平台？

Brandon Bryant of HandCash talks small payments, games, and apps on More Than Mo...

BikeFair's NFT-based bicycle market gives peace of mind to over 3,000 users - Co...

Linux: Vulnerabilities in nf_tables cause privilege escalation, information leak

代码Review，瑞出事来了！

About Joyk