Falsehoods more likely with large language models

Image Credit: Getty Images

What’s next for open source in the decade of data

Open source is an engine for innovation, offering reliability, scalability and security for IT leaders intent on future-proofing their infrastructure. Learn how.

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!

There’s growing interest in using AI language models to generate text for business applications. Large companies are deploying their own systems while others are leveraging models like OpenAI’s GPT-3 via APIs. According to OpenAI, GPT-3 is now being used in over 300 apps by thousands of developers, producing an average of more than 4.5 billion novel words per day.

But while recent language models are impressively fluent, they have a tendency to write falsehoods ranging from factual inaccuracies to potentially harmful disinformation. To quantify the risks associated with “deceptive” models, researchers at the University of Oxford and OpenAI created a dataset called TruthfulQA that contains questions some humans might answer incorrectly due to false beliefs or misconceptions. The researchers found that while the best-performing model was truthful on 58% of questions, it fell short of human performance at 94%.

How Chipotle Became the Top Restaurant Brand for COVID-19 Safety Compliance Using ML and Third-Party Data 1

TruthfulQA

In the subfield of AI known as natural language processing (NLP), robustness testing can be the exception rather than the norm. One report found that 60% to 70% of answers given by NLP models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

TruthfulQA aims to avoid these benchmarking pitfalls with a bank of questions about health, law, finance, and politics that requires models to avoid generating false answers learned from text. The dataset spans 817 questions in 38 different categories, all of which were worded by the researchers such that some humans and models might answer falsely.

The researchers tested several different models on TruthfulQA, including GPT-3; GPT-3 predecessor GPT-2; open source versions of GPT-3 called GPT-Neo and GPT-J; and UnifiedQA, a model fine-tuned on question-answer tasks. To classify answers from the models as either true or false, the team developed “GPT-judge,” an algorithm trained on answers to questions from TruthfulQA from all of the evaluated models.

Above: Examples of falsehoods generated by models tested on the dataset.

Interestingly, the results show that larger models generally perform worse than smaller models in the same family. The size of a model is measured by the number of parameters it contains — variables internal to the model that the model learns from historical training data. For example, the largest GPT-Neo and GPT-J models were 17% less truthful (as measured by TruthfulQA) than a model 60 times as small. Meanwhile, UnifiedQA did better on truthfulness than the three GPT families, with the largest model performing only slightly worse than the smallest.

When forced to choose from multiple answers rather than generate them, larger models also performed worse on TruthfulQA than smaller ones. No models significantly outperformed random guessing. And even the “best” model gave false answers 42% of the time, versus 6% for human participants. (Eighty-seven percent of the humans’ answers were true on TruthfulQA.)

The researchers speculate that the models haven’t learned the training distribution well enough or that the models’ training objectives actually incentivize false answers. “We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web,” the researchers wrote in a preprint paper, “TruthfulQA: Measuring How Models Mimic Human Falsehood.” They added: “[Our preliminary work finds] that today’s large models are much less truthful than humans.”

Large language models

The work adds to growing skepticism that the size of language models — and their training datasets — correspond to performance. Earlier this month, a team of Google researchers published a study claiming that a model much smaller than GPT-3, fine-tuned language net (FLAN), bests GPT-3 by a large margin on a number of challenging benchmarks. And scientists at the Institute for Artificial Intelligence at the Medical University of Vienna, Austria found that GPT-3 underperforms in domains like biomedicine compared with smaller, less architecturally complex but carefully fine-tuned model.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says that when it comes to natural language, the question of whether larger models are the right approach is still open. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

“The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” Antoniak told VentureBeat in a previous interview. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”

VentureBeat

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

up-to-date information on the subjects of interest to you
our newsletters
gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
networking features, and more

Become a member

How to scale your indie: The Bit Fry game studio growth story

Katie Cole, PerforceAugust 23, 2021 06:20 AM

image-article-helix-sync.jpg?fit=930%2C465&strip=all

Join the GamesBeat community!

Enjoy access to the GamesBeat Discord, special events, private newsletters and more.

Join here

Presented by Perforce

Founded in 2013, Bit Fry’s mission was to deliver a high-quality arcade experience straight to your smartphone. They wanted to bring back the look and feel of time-tested favorites like Blades of Steel, NBA Jam, NFL Blitz, and more.

Over the next six years, Bit Fry evolved into a gaming franchise. After their hit game Ultimate Rivals: The Rink launched in 2019 on Apple Arcade service, they embarked on their follow up game, Ultimate Rivals: The Court, which recently launched July 2021.

With teams constructing over 137 characters and counting, Bit Fry needed a way to scale their development pipeline on a tight timeline. On their journey, there were able to increase velocity, secure everything, and unify their teams as they transitioned to an on-premises solution.

Conquer challenges to accelerate and scale

When Bit Fry started working on their next game, teams were struggling to get the files and feedback they needed. Sync times were long. Builds took forever. The team was desperate for a solution. Initially, they looked at moving to Git.

But Git couldn’t handle their large files and binary assets. It also lacked integrated workflows to support animators, designers, and artists. Instead of moving to Git, Bit Fry needed to optimize their environment and scale.

Chris Kuffert, engineering director at Bit Fry explains, “I’m very glad we didn’t switch to Git at the end of it. The biggest reason was I don’t know how effectively it could handle locking of files.”

File locking, exclusive checkouts, and support for creatives were critical to iterate and test more. Without these features, they could easily overwrite files and binaries, which could lead to a time-consuming mess.

To resolve these issues, Bit Fry required a tool that could support how they work and meet the performance demands required for a growing studio. With quicker access, they could test more, and produce a better game.

Moving to on-premises was the first step. Then they could build out their pipelines. Perforce Helix Core version control provided the features teams needed. And by moving to their own servers, they could optimize for performance, dramatically shortening build times.

“We’re now at a point where we not only have five consistent builds running, but also the opportunity for all our engineers to run a subset of builds on shelved code. That has increased our velocity immensely,” according to Kuffert.

Sync times went from three hours to 10 minutes. Developers could check in code and artists could upload their assets without delay. Keeping teams moving increased innovation, without pushing their release date.

Because all of Bit Fry’s digital assets were stored and in one central depot, they could also enhance team collaboration.

Cross-team collaboration

Before, Bit Fry’s teams were collaborating, but not inside their tools. Builds required artists to contribute, but it would take up all the available bandwidth, slowing everyone else down. To avoid this, designers rarely pushed changes. This would impact developers, causing delays. Assets and code were left sitting outside of the server.

Setting up their architecture on-premises, Bit Fry removed barriers for their teams. Coders and creatives could push changes and files frequently. Bit Fry immediately noticed a change. Their depot grew exponentially, bumping up to 3 TB.

As people connected remotely, they were still able to get what they needed, fast. Teams could grab assets from other areas to repurpose. Central storage eliminated searching through emails and hard drives, promoting asset reuse and increasingly velocity.

Mark Strelow, director of animation, noticed his teams were able to easily get what they needed. “Our art directory contains all our animation assets. If someone’s working on a Maya file, they’ll do it straight in Perforce. And it’s ready for anyone else to grab.”

His animation team experienced improvements as well. Versioning was simple and faster. SJ Belen, animator at Bit Fry explains, “I don’t need to know how all this stuff works. It’s super simple. I can check out a file, get the latest files, and check them back in.”

Securely version everything

Security is a critical issue for game development companies, especially as they grow. Bit Fry recognized the need to balance access and security. They set up their environment to protect down to the individual file level. With Bit Fry’s source code and secrets safe, outsourced contributors could get access to only what they needed.

Keeping assets secure means protecting and efficiently storing all subsequent versions. Teams need to look back in time to know when, where, and how something evolved. Maintaining chain of custody over digital assets ensured the final game quality was high.

According to Art Director Sean O’Toole, “Perforce keeps things organized and retains our entire history.” This is a huge win for development teams and designers alike.

For Belen, “The iterative submission style allowed you to go back in and find an old version in the depot that someone worked on before, even if the current one maybe doesn’t work.” No matter the file, being able to secure and locate kept Bit Fry moving.

Support when and where you need it

When Bit Fry needed to migrate, they were in the middle of a release. “Our contract was expiring with our hosted solution, so we had to switch by the end of the month. But we also had to submit a delivery by the end of that month, and nothing was going to budge,” said technical director Alexander Brooks.

Perforce team members were vital to ensuring teams had no downtime as they moved to an on-premises solution. The shift happened mid-work week, with absolutely no disruption.

All assets were migrated with no data loss, no loss of logs, and no delays to development. Brooks gave his seal of approval, “There cannot be an hour of downtime and there wasn’t. We were good to go.”

How to build your dream team

This is only the start for Bit Fry. As they grew, they discovered, “If you spend the time and effort learning how to do it or set it up yourself, you’re setting your company up for more success and more flexibility,” said Brooks.

Want to see Bit Fry in action? Check out Ultimate Rivals: The Court. If you want to learn more about how Bit Fry made it work, join our webinar.

Professional tools at a premium price

Are you a small team with big ideas? Try Perforce Helix Core version control, free.

Katie Cole is Game Dev Evangelist at Perforce.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. Content produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact [email protected].

Falsehoods more likely with large language models

Falsehoods more likely with large language models

What’s next for open source in the decade of data

TruthfulQA

Large language models

VentureBeat

How to scale your indie: The Bit Fry game studio growth story

Join the GamesBeat community!

Conquer challenges to accelerate and scale

Cross-team collaboration

Securely version everything

Support when and where you need it

How to build your dream team

Professional tools at a premium price

Recommend

Create a React Native sub-component from Component

手贱升级到了 win 11，体验了两天还是回到 win 10 了，真香

A powerful react native starter template that bootstraps development of your mob...

手把手教你用 reflect 包解析 Go 的结构体 - Step 3: 复杂类型检查

[ Python ]如何取回生成器的返回值

pip search 搜索目前（2021 年 9 月）还是不能用，大家目前在用什么替代方案呢？

设计师怎样快速了解B端设计的业务呢？

Lightspeed Venture on India start-up ecosystem and investment outlook

数据结构与算法分析学习笔记(四) 栈

On the Importance of Wide Range of Knowledge: Meet Paul Bailey, Systems Architec...

About Joyk