The future of AI is entertainment

How GANs may be combined to create the YouTube of tomorrow

The most attention grabbing part of Star Trek’s Holodeck is the way it generates an artificial physical reality.

The second most attention grabbing part is the way the computer of the U.S.S. Enterprise can generate scenarios and stories on the fly, given only a basic writing prompt. Lots of Star Trek’s most memorable episodes involve the crew role playing in stories they’ve ‘programmed’ into the computer, and of course, things often go wrong in unexpected ways.

Programming the Holodeck usually consists of just telling the ship what they want. Given that this part of the technology doesn’t violate any known laws of physics, it’s the part most likely to actually happen within my lifetime.

It’s becoming clear that if we use VR goggles instead of holograms, this sort of entertainment technology is actually within our reach. Shamefully and bizarrely it feels like almost nobody is talking about it: AI technology is developing in the middle of a vast but groundless moral panic in which research results are being restricted for political reasons. But let’s put that aside for a moment and explore what fascinating technologies may be just around the corner.

We are on the cusp of developing AI software that can imagine entire fictional worlds and stories just for us, on the fly, given only vague hinting about what is wanted. These stories could take place inside AI generated virtual realities filled with content put together by anyone who can speak. Individual and collaborative worlds will be uploaded to a future alternative to YouTube, where user-created/AI-assisted content will be streamed, shared and remixed.

How can we build such a thing?

Generative networks

The core technological leap that makes this possible is the generative adversarial network, or GAN for short. A GAN is conceptually simple: two neural networks are pitted against each other in a kind of AI fight. The first network (the discriminator)is trained on some data set, like pictures of faces. It is set the task of guessing whether the input data is synthetic or not. The second network (the generator) is designed to generate outputs given random input. It’s set the task of generating an output the discriminator can’t tell is fake.

Both networks are co-trained simultaneously, with them both getting better as training proceeds. Eventually the generator becomes sufficiently good at generating content that the discriminator can’t tell the difference — and nor can humans.

It’s not a complicated idea, but the result is profound: for the first time, machines have been given imagination. Here are a few of my favourite results so far.

The now-standard example is randomly generated photos of faces. It can be done for other things too, like cars, cats … whatever the network was trained on. Usefully, we’ve learned how to control the precise style of the resulting images too:

AI can compose music. OpenAI MuseNet can combine different styles and create pleasant albeit unexciting music, given either a fixed prompt or your own knob-tweaking.

nVidia did an AI composition and got an orchestra to play it:

The famous GPT-2 text generation model that generates random text given a starting prompt:

the-future-of-ai-is-entertainment-1841fbb400df

Then there’s GauGAN, which converts coloured-coded sketches into imaginary photos:

You can of course also do this in 3D:

And finally, rendering virtual worlds given a basic conceptual model generated by a game engine (in effect, it’s rendering the textures and lightmaps):

You get the idea. GANs can be used in several different ways:

Generating totally random imitations of the sort of thing they were trained on (like faces, or pieces of music).
Filling in detail given a basic skeleton or sketch of something.
Predicting “what comes next” in various ways (not only with words).

This is neat, but, we’re still a long way from building the Holodeck. The rest of this article is devoted to describing problems you’d have to solve to build the full VR experience.

Problem 1: Combining GANs

The first issue we hit is that — so far — there isn’t one universal neural network that can generate everything. Most research has focussed on creating standalone AIs that specialise in one particular task, like generating cats, or musical compositions. A single AI that can generate both appears to be, for now, out of reach.

It’s also not entirely clear how you’d train such a thing. AIs need to be given lots of examples to learn what they’re meant to do. We have very few examples of full blown imaginary interactive worlds: just video games, but there probably aren’t enough samples of those to learn how to generate them, and besides, we’d want a lot of fine grained control. Totally random worlds aren’t sufficient.

A lot of the cleverness that’ll be needed to build a real virtual world will thus need to come from combining GANs together. One for the music. One for the faces. One for the bodies. One to warp and deform the face into a texture map that can be fed to a 3D engine.

Wait, what? Didn’t I just present a video of GANs generating 3D worlds above?

Well, yes. But it’s not currently clear to me if it makes sense to try and generate everything using neural networks, for two big reasons: cost and coherency.

Problem 2: Cost

Training GANs that generate something plausible sometimes isn’t too expensive. Training them well enough to generate something good every time, is very difficult and expensive. The hardware needed to do the training requires specialised chips, some types of which aren’t even sold — they’re available only to rent from firms like Google. But the real cost comes from creating a big and clean enough data set to train on. Researchers tend to reuse the same datasets again and again, because creating them takes a lot of time and effort.

It’s not just training. Actually using the networks — called inferencing — is also a task that can stress the hardware most people has. Google started shipping their mobile phones with dedicated AI accelerators, but that’s still very new tech. I can’t run TensorFlow tasks on my ordinary MacBook laptop because I don’t have the fancy GPUs required. Consumer usage of AI tasks thus often gets outsourced to the cloud, which is fine, but clouds have fat profit margins for a reason — they’re expensive!

So we want to keep the amount of AI we’re doing minimal, to keep it affordable. Fortunately, in many cases we don’t need to train an AI to do a task because we already studied the problem and found good algorithms for it using our own human ingenuity.

Places where we probably don’t need AI

AI can generate music by predicting the next waveform. But we don’t need to predict what music sounds like split second by split second. It’s enough for an AI to generate something more like sheet music and then render it using ordinary music software, which is already capable of producing synthetic orchestras that sound completely real. It’s probably much faster to generate MIDI files and then use classical algorithms to render that to audio than go directly from AI to audio.

AI can render photos of 3D scenes, but we already know how to render photo realistic worlds in real time.

AI can predict how physics causes objects to bump and collide, but the laws of physics have been understood perfectly for centuries. Completely accurate hardware-accelerated physics simulation has been a part of high end games for years.

Beyond the issue of cost, there’s another reason to avoid AI in these cases.

Problem 3: Coherence

If you look closely at GAN generated output, it rapidly becomes clear that there are often problematic glitches. This is true even on models trained for weeks with huge quantities of data.

These glitches are fun to point out, so let’s do it! GPT-2’s imaginary stories seem to make sense only if you aren’t paying attention:

These four-horned, silver-white unicorns were previously unknown to science

The AI learned that unicorns have horns, but didn’t learn the rather critical detail that they only have one. GPT-2 was asked to write a fantasy story:

Aragorn drew his sword, and the Battle of Fangorn was won. As they marched out through the thicket the morning mist cleared, and the day turned to dusk.

Is it morning, or is it evening? The AI doesn’t have a really firm grip on time.

This happens with pictures too:

GANs get amazing results on faces because they’ve been given incredibly consistent training data — the internet is full of pictures of celebrities on very consistent block-colour backgrounds, and we already have algorithms that can do face detection so cropping and centering the face is easy. But GANs have a hard time understanding ‘common sense’ things that appear less often, like the idea that earrings and eye colours usually match.

For faces this can probably be eliminated with just bigger and better models, but the general issue remains: ask an AI to generate something and sometimes, occasionally, it will generate something that doesn’t make any sense. This is because it’s learning how to ‘guess’ at what it’s meant to do. When we already know enough about the world to have rigorous algorithms for computing the right answer, it’ll likely always be preferable to do so, because those algorithms are based on real understanding and will get it right every time.

This is especially true given that the more you rely on AI to generate details of your world the more the errors will compound and stack up.

Controlling the generation

So far we’ve seen only very limited control over what the AI produces. You can sketch images with labels and it generates a photo. You can tweak knobs to get different faces. But in Star Trek, the characters describe what they want by just saying it. Can we do that too?

Remarkably, the answer is yes!

We’ve already nearly mastered speech recognition, again, using AI technology. Modern Google recognition is practically as good as people are, and could easily soon be better. All that’s left is to ‘imagine’ things based on textual descriptions. Enter, StackGAN:

GANs have a remarkable ability to do “transfers” — complicated transformations from one kind of input to another, based on learned examples. Here it’s translating from text to photos: exactly what we need.

Putting it all together

So we can start to see what it might take to generate 3D worlds given only a recording of someone saying what they want.

First we need a GPT-2 type model that auto-completes plausible stories and descriptions given a writing prompt. Mistakes in the text can be corrected by the user incrementally adding more detail to the writing prompt. This is necessary because GANs are, at some deep level, just bluffing their way through the world. They can produce predictable, turgid prose, but will find it tough to imagine anything truly new or interesting. The human input is what will make the difference between a boring, seen-it-all-before world and something worth exploring.

Next we need a model that can expand those generated descriptions into images, placements of objects in scenes, characters and so on. The outputs of these models would be fed as data to ordinary game engines like Unreal, because those engines are already very good at converting 3D data into pixels in ways that are reliably convincing, highly controllable and cost efficient.

Characters need scripts — back to the text generation models again — and voices, but speech synthesis is something else GANs are good at, so that’s not a problem. They also need plausible interactions and emotions. AI can learn to do that too: one of the last demos I saw at Google before I left was an AI annotating a children’s story with notes about how the characters were feeling at each point. AI driven scripts and interactions are a topic I’ve visited back in 2017.

The experience may benefit from an emotional, swelling musical score. Name the pieces of music you want your track to be inspired by, and a GAN will generate one that sounds like them but which is nonetheless unique. This is going to play havoc with copyright law.

Every time the AI makes a mistake and generates something implausible, the human can provide more input and set it back on track.

The final results can be published on a new YouTube-like site for sharing the resulting worlds and sometimes monetising them (hopefully not via annoying adverts). The source code of the game, the experience, the virtual lesson, the architectural exploration or whatever else people create ends up being English text.

Conclusion

What will become of today’s artists, musicians, actors and scriptwriters?

In the same way much ink has been spilled over the fate of truck drivers in a world of self-driving vehicles, I suspect we will be soon be drowning in mass hysteria about unemployment of culture creators. But I see no reason to worry in the long run. Creativity will become primarily about imagination, rather than the painful logistics of converting imagined things into reality. Lab results are impressive but often cherry picked, and will take a long time to percolate through into regular usage. By the time they do, people will have already raised their ambitions high enough that AI assistance will seem like a basic requirement rather than a job threat.

I can also forsee an equivalent of the YouTube effect: the ability to easily create movies, music and games based on nothing more than words will result in us drowning in vast quantities of derivative, uninspired content based on whatever the AIs consider the most likely prediction — stories set in the present day, in ordinary cities, with ordinary looking people and ordinary, everyday events. YouTube wouldn’t work without viral sharing, algorithmic recommendations and other ways to sort the wheat from the chaff, and it’ll be the same for our future Holodeck.

Are you ready? No? Then get writing!

The future of AI is entertainment

The future of AI is entertainment

How GANs may be combined to create the YouTube of tomorrow

Generative networks

Problem 1: Combining GANs

Problem 2: Cost

Places where we probably don’t need AI

Problem 3: Coherence

Controlling the generation

Putting it all together

Conclusion

Recommend

每周以太坊进展 2021/04/04

Dell Alienware 发布首款 AMD 游戏本

I want to see a libertarian Star Trek

Introducing Dawn (Part 1)

为什么使用Tailwind Css框架?

YouTube 是美国最受欢迎的社交媒体平台

Nacos 2.0 性能提升十倍，贡献者 80% 以上来自阿里之外

Did Russian bots impact Brexit?

Redux 中间件到底怎么工作的呢？

低代码平台会像中台一样烂大街吗？

About Joyk