10

We tried Turnitin's ChatGPT-detector for teachers. It can misidentify cheating....

 1 year ago
source link: https://www.washingtonpost.com/technology/2023/04/01/chatgpt-cheating-detection-turnitin/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Student or AI bot? New ChatGPT writing detectors can get it wrong.

Five high school students helped our tech columnist test a ChatGPT writing detector coming to 2.1 million teachers from Turnitin. It missed enough to get someone in trouble.

Analysis by Geoffrey A. Fowler
Columnist|
April 1, 2023 at 6:00 a.m. EDT
chatgpt-cheating-detection-turnitin
Lucy Goetz, a student at Concord High School, California, was surprised to discover new software had erroneously flagged her essay as being partly completed by AI. (Andria Lo for The Washington Post)
Listen
11 min
Comment
Gift Article
Share

High school senior Lucy Goetz got the highest possible grade on an original essay she wrote about socialism. So imagine her surprise when I told her that a new kind of educational software I’ve been testing claimed she got help from artificial intelligence.

A new AI-writing detector from Turnitin — whose software is already used by 2.1 million teachers to spot plagiarism — flagged the end of her essay as likely being generated by ChatGPT.

“Say what?” says Goetz, who swears she didn’t use the AI writing tool to cheat. “I’m glad I have good relationships with my teachers.”

After months of sounding the alarm about students using AI apps that can churn out essays and assignments, teachers are getting AI detection technology of their own. On April 4, Turnitin is activating the software I tested for some 10,700 institutions including the University of California, assigning “generated by AI” scores and sentence-by-sentence analysis to student work. It joins a handful of other free detectors already online. For many teachers, AI detection offers a weapon to deter a 21st-century form of cheating.

Advertisement

But AI alone won’t solve the problem AI created. The flag on a portion of Goetz’s essay was an outlier, but shows detectors can sometimes get it wrong — with potentially disastrous consequences for students. Detectors are being introduced before they’ve been widely vetted, yet AI tech is moving so fast, any tool is likely already out of date.

It’s a high-stakes moment for educators: Ignore AI and cheating could go rampant. Yet even Turnitin’s executives tell me that treating AI purely as the enemy of education makes about as much sense in the long run as trying to ban calculators.

Ahead of Turnitin’s launch this week, a “significant majority” of universities in the United Kingdom have told the company to hold off activating AI scores on their student work, according to UCISA, the professional body for digital educators. So have some 50 American institutions.

To see what’s at stake, I asked Turnitin for early access to its software. Five high school students, including Goetz, volunteered to help me test it by creating 16 samples of real, AI-fabricated and mixed-source essays to run past Turnitin’s detector.

Advertisement

The result? It got over half of them at least partly wrong. Turnitin accurately identified six of the 16 — but failed on three, including a flag on 8 percent of Goetz’s original essay. And I’d give it only partial credit on the remaining seven, where it was directionally correct but misidentified some portion of ChatGPT-generated or mixed-source writing.

Turnitin claims its detector is 98 percent accurate overall. And it says situations such as what happened with Goetz’s essay, known as a false positive, happen less than 1 percent of the time, according to its own tests.

Turnitin also says its scores should be treated as an indication, not an accusation. Its software shades suspected AI passages in blue, not red, and links to teacher resources underneath its score. Still, will millions of teachers understand they should treat AI scores as anything other than fact? Unlike accusations of plagiarism, AI cheating has no source document to reference as “proof.”

Advertisement

“Our job is to create directionally correct information for the teacher to prompt a conversation,” Turnitin chief product officer Annie Chechitelli tells me. “I’m confident enough to put it out in the market, as long as we’re continuing to educate educators on how to use the data.” She says the company will keep adjusting its software based on feedback and new AI developments.

The question is whether that will be enough. “The fact that the Turnitin system for flagging AI text doesn’t work all the time is concerning,” says Rebecca Dell, who teaches Goetz’s AP English class in Concord, Calif. “I’m not sure how schools will be able to definitively use the checker as ‘evidence’ of students using unoriginal work.”

Unlike accusations of plagiarism, AI cheating has no source document to reference as proof. “This leaves the door open for teacher bias to creep in,” says Dell.

For students, that makes the prospect of being accused of AI cheating especially scary. “There is no way to prove that you didn’t cheat unless your teacher knows your writing style, or trusts you as a student,” says Goetz.

Why detecting AI is so hard

Spotting AI writing sounds deceptively simple. When a colleague recently asked me if I could detect the difference between real and ChatGPT-generated emails, I didn’t perform very well.

Advertisement

Detecting AI writing with software involves statistics. And statistically speaking, the thing that makes AI distinct from humans is that it’s “extremely consistently average,” says Eric Wang, Turnitin’s vice president of AI.

Systems such as ChatGPT work like a sophisticated version of auto-complete, looking for the most probable word to write next. “That’s actually the reason why it reads so naturally: AI writing is the most probable subset of human writing,” he says.

Turnitin’s detector “identifies when writing is too consistently average,” Wang says.

The challenge is that sometimes a human writer may actually look average.

On economics, math and lab reports, students tend to hew to set styles, meaning they’re more likely to be misidentified as AI writing, says Wang. That’s likely why Turnitin erroneously flagged Goetz’s essay, which veered into economics. (“My teachers have always been fairly impressed with my writing,” says Goetz.)

Advertisement

Wang says Turnitin worked to tune its systems to err on the side of requiring higher confidence before flagging a sentence as AI. I saw that develop in real time: I first tested Goetz’s essay in late January, and the software identified much more of it — about 50 percent — as being AI generated. Turnitin ran my samples through its system again in late March, and that time only flagged 8 percent of Goetz’s essay as AI-generated.

But tightening up the software’s tolerance came with a cost: Across the second test of my samples, Turnitin missed more actual AI writing. “We’re really emphasizing student safety,” says Chechitelli.

Turnitin does perform better than other public AI detectors I tested. One introduced in February by OpenAI, the company that invented ChatGPT, got eight of our 16 test samples wrong. (Independent tests of other detectors have declared they “fail spectacularly.”)

Advertisement

Turnitin’s detector faces other important technical limitations, too. In the six of 16 of our samples it got it completely right, the samples were all clearly 100 percent student work or produced by ChatGPT. But when I tested it with essays from mixed AI and human sources, it often misidentified the individual sentences or missed the human part entirely. And it couldn’t spot the ChatGPT in papers we ran through Quillbot, a paraphrasing program that remixes sentences.

What’s more, Turnitin’s detector may already be behind the state of the AI art. My student helpers created samples with ChatGPT, but since they did the writing, the app has gotten a software update called GPT-4 with more creative and stylistic capabilities. Google also introduced a new AI bot called Bard. Wang says addressing them is on his road map.

Some AI experts say any detection efforts are at best setting up an arms race between cheaters and detectors. “I don’t think a detector is long-term reliable,” says Jim Fan, an AI scientist at Nvidia who used to work at OpenAI and Google.

“The AI will get better, and will write in ways more and more like humans. It is pretty safe to say that all of these little quirks of language models will be reduced over time,” he says.

Is detecting AI a good idea?

Given the potential — even at 1 percent — of being wrong, what’s the hurry to release an AI detector into software that will touch so many students?

Advertisement

“Teachers want deterrence,” says Chechitelli. They’re extremely worried about AI and helping them see the scale of the actual problem will “bring down the temperature.”

Some educators worry it will actually raise the temperature.

Mitchel Sollenberger, the associate provost for digital education at the University of Michigan-Dearborn, is among the officials who asked Turnitin not to activate AI detection for his campus at its initial launch.

He has specific concerns about how false positives on the roughly 20,000 student papers his faculty run through Turnitin each semester could lead to baseless academic-integrity investigations. “Faculty shouldn’t have to be expert in a third-party software system — they shouldn’t necessarily have to understand every nuance,” he says.

Ian Linkletter, who serves as emerging technology and open-education librarian at the British Columbia Institute of Technology, says the push for AI detectors reminds him of the debate about AI exam proctoring during pandemic virtual learning.

Advertisement

“I am worried they’re marketing it as a precision product, but they’re using dodgy language about how it shouldn’t be used to make decisions,” he says. “They’re working at an accelerated pace not because there is any desperation to get the product out but because they’re terrified their existing product is becoming obsolete.”

Deborah Green, CEO of UCISA in the U.K., tells me she understands and appreciates Turnitin’s motives for the detector. “In some ways it’s an inevitable development,” she says. But most of her members want to hit pause. “What we need is time to satisfy ourselves as to the accuracy, the reliability and particularly the suitability of any tool of this nature.”

It’s not clear how the idea of an AI detector fits into where AI is headed in education. “In some academic disciplines, AI tools are already being used in the classroom and in assessment,” says Green. “The emerging view in many U.K. universities is that with AI already being used in many professions and areas of business, students actually need to develop the critical thinking skills and competencies to use and apply AI well.”

There’s a lot more subtlety to how students might use AI than a detector can flag.

My student tests included a sample of an original student essay written in Spanish, then translated into English with ChatGPT. In that case, what should count: the ideas or the words? What if the student was struggling with English as a second language? (In our test, Turnitin’s detector appeared to miss the AI writing, and flagged none of it.)

Would it be more or less acceptable if a student asked ChatGPT to outline all the ideas for an assignment, and then wrote the actual words themselves?

“That’s the most interesting and most important conversation to be having in the next six months to a year — and one we’ve been having with instructors ourselves,” says Chechitelli.

“We really feel strongly that visibility, transparency and integrity are the foundations of the conversations we want to have next around how this technology is going to be used,” says Wang.

For Dell, the California teacher, the foundation of AI in the classroom is an open conversation with her students.

When ChatGPT first started making headlines in December, Dell focused an entire lesson with Goetz’s English class on what ChatGPT is, and isn’t good for. She asked it to write an essay for an English prompt her students had already completed themselves, and then the class analyzed the AI’s performance.

The AI wasn’t very good.

“Part of convincing kids not to cheat is making them understand what we ask them to do is important for them,” said Dell.

Help Desk: Making tech work for you

Help Desk is a destination built for readers looking to better understand and take control of the technology used in everyday life.

Take control: Sign up for The Tech Friend newsletter to get straight talk and advice on how to make your tech a force for good.

Tech tips to make your life easier: 10 tips and tricks to customize iOS 16 | 5 tips to make your gadget batteries last longer | How to get back control of a hacked social media account | How to avoid falling for and spreading misinformation online

Data and Privacy: A guide to every privacy setting you should change now. We have gone through thesettings for the most popular (and problematic) services to give you recommendations. Google | Amazon | Facebook | Venmo | Apple | Android

Ask a question: Send the Help Desk your personal technology questions.

Show more
ChevronDown

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK