Don't Fire Your Illustrator

My academic training is in Fine Art, painting, and print-
making. My professional career for the past 20 years has
been in software engineering, including machine learning.
This makes me uniquely situated to panic about discuss
image-generative AI systems like Midjourney, DALL-E, etc.

This essay comes to you in two parts (both of which are right
here on this page).

Part I is a mostly-un-opinionated technical description of how
one popular branch of AI image generation currently works. If
you’re already familiar enough with stable diffusion to under-
stand the terms “latent space” and “text transformer,” you
can skip ahead.

Part II is a very opinionated prediction of how this technology
will be successfully used and by whom.

PART I: Stable diffusion

I a) The Latent Space

If you want to talk about colors, there are more and less
useful ways to name them for different tasks. Take the
color “pinkish purplish autumn mist” and make it a little
warmer; what color is that? Mix a little ultramarine, a little
alizarin crimson, a tiny dot of cadmium yellow, and a good
blob of titanium white. Make that a little warmer; what color
is that? Take the RGB color 234, 182, 227, and make it a
little warmer; what color is that? Or the HSV color 308° 22° 92°. Or the Lch color 80/30/330.

When we use lists of numbers to name points in a physical
space, we don’t do so randomly.

Several points plotted on an x and y axis, labeled with the numbers (5, 4), (6, 5) and (-5, -10)

We pick a numbering to ensure that when the numbers are
close, the locations they name are close, and when the
numbers are very different, the locations they name are far
apart.

We can also use a list of numbers to describe a non-physical
space, like the space of colors:

A cube of colors representing all of RGB space, with red increasing to the left, green to the top, and blue into the picture plane

And we want the same rules to apply: similar colors should
have similar numbers, and similar numbers represent similar
colors. There are many ways to give colors numbers —
RGB, Lch, CMYK — and each numbering results in slightly
different relationships between those colors. The goal is
always “put similar things near to one another,” but we might
define “similar things” in a variety of subtly different ways.

The numbers don’t contain the appearance of colors nor the
literal pigments they’re made from — it’s a labeling system,
not a filing system. The numbers for a color are both a label
and a set of instructions: mix this much red light, this much
green light, and this much blue light, and you’ll get the color
that these numbers label.

Words are more complicated than colors. Still, one could
imagine assigning every word a bunch of numbers — perhaps
hundreds, instead of just three — in such a way that “mom”
and “dad” have similar numbers, and “mom” and “prestidig-
itation” are further apart.

Researchers have used AI to take lots and lots of text and (by
assuming that words that appear near each other in text are
related in some way) build a space of words that’s like that.

Whole images are even more complicated than single words
or colors, but (by using thousands of numbers) we can
imagine “spaces” where similar images are represented by
similar lists of numbers.

We could do that by listing every RGB color of every pixel
in an image — that will make some similar images close to
each other! There are downsides, though. That method takes
millions of numbers. Worse, some similar pictures won’t be
near to one another at all: for instance, the RGB pixels in a
picture of a black cat and the RGB pixels in a picture of a white
cat will be very, very different, even if they’re otherwise very
similar pictures of cats.

So how do we build a useful space for images? It’s one thing
to assume words that appear near one another are similar
because the same word gets used in millions of different situa-
tions, with heaps of nearby words. Most images only get used
once, and maybe not near any other images at all!

We can borrow our latent space of words, though: Lots of pict-
ures have captions, labels, or words nearby. Researchers have
used AI to build image spaces where pictures with similar
captions are near one another. Pictures labeled “cat” are near
to each other and pictures labeled “car” are near to each other,
and pictures labeled “Miyizaki cat-bus” are somewhere
between.

In generative AI research, these image-spaces space is called
image “latent spaces.”

A latent space can give you a list of numbers that approxi-
mates any image you might ever want to see, and every
random uninterpretable image, too. It’s Borges’ Library of
Babel but for pictures.

How do we navigate that space full of sense and nonsense to
find the numbers for images we want to see?

I b) Stable diffusion

(Understand that what I’m about to describe, called stable
diffusion, isn’t the only way to accomplish this, but it’s a
popular one)

Imagine seeing a picture of a cat on a staticky television: If you
squint, you can make out the cat. If you wanted to, you could
paint out the static and reveal the cat more clearly.

Imagine seeing a cat behind a lot of static. You could squint,
sketch in some lines, squint at those lines, and probably get a
picture of a cat, though not exactly the same picture of a cat.

And maybe you can imagine seeing pure static and convincing
yourself there’s a cat in there somewhere, and with a lot of
squinting, slowly teasing out a staticky picture of a cat, then
a less staticky one, and then a clear picture of a cat.

This is stable diffusion: We took millions of images, made
them a little noisy, put them in latent space, and trained the
computer to clean them up. And then, we took the noisy
images, made them noisier, and trained the computer to make
them less noisy. And we made those noisier, and then even
noisier, until the images were obliterated, and the computer
could, step by step, hallucinate its way back to some image.
Not necessarily a good one, or a useful one, but a non-noisy
image.

We can also take our word space and train the computer to try
and associate an image in image-latent-space to some words
in word-latent-space and say how likely an image is to have
a particular caption. A picture of a cat is likely to have the
caption “my cute kitty mister french fry” and unlikely to have
the caption “the engine from a 1959 Austin Healey.”

The last piece of the puzzle is called “cross attention,” which
is a fancy way of saying, “glue several AI systems together,
so they can do two things at once.” In particular, we can ask
the computer to remove some static from an image AND
nudge that image to be more likely to be captioned with some
specific text.

And that’s it — that’s generative AI.

Note some important things:

while every image in the training set is representable by some numbers in latent space, the images themselves are not there in any specific way.
To steer the process, the words you type get turned into points in a word latent space, which then gets retranslated into a movement in image latent space. That’s hard! And making small adjustments to the result using just words can be very hard!
The kinds of images that are very common — statistically likely — get organized in bigger and more well-organized parts of latent space.
There are other systems for steering the denoising process — using text labels is just one of them — but all of them involve cross-attention between the denoising and some other goal.

A test of understanding: why does this produce extra fingers
and deformed hands? The wrong way of thinking: every input
image the model has trained on has correct hands, so it should
learn to draw correct hands! The latent space describes every
possible image. The nudge away from noise prioritizes clarity.
The nudge towards a text label is satisfied by any image that
would best be labeled by that text. Image captions rarely
mention people’s hands, especially not with correct and incor-
rect numbers of fingers. There’s not much pressure toward
perfect hands, and adding negative labels like “no deformed
hands” relies on a very small number of images out there
labeled “deformed hands.” (You might have better luck with
a negative label for “polydactyly” since that label correlates
very strongly to extra fingers) Most generative systems have
separate components specifically to correct faces because the
same issue applies; the core system is content if there’s somet-
hing face-like, while our eyes are very picky about faces being
correct, with the eyes pointed in the same direction most of
the time.

A test of understanding: why does adding an artist’s name,
like James Gurney, produce better images overall? Most
image captions are nouns: a person, an object — the pict-
ure’s focal point. The background can be nonsense; if the
foreground is a teapot, it’s fair to caption it as “a teapot.”
Non-teapot parts of the image aren’t under much pressure.
An artist’s style is a gestalt; it doesn’t exist in just one part of
the image, but the whole thing; every pixel is under correlated
pressure, so a coherent outcome is a little more likely.

PART II: Who should use AI image generation, and how

There are some opinions here that will be unpopular with my
artist friends. Some will be unpopular with technologists or
neophilic managers. I certainly don’t want you to think I am
pleased about any of these opinions, or that I want them to
come to pass; these are simply my predictions based on a
goodish understanding of both generative AI and traditional
image-making.

An opinion popular with creatives: Physical media artists
will keep their jobs.

The human desire for real paintings, sculptures, woodcuts,
embroideries, and so on isn’t going to vanish.

I don’t have much to say about that; I mostly mention it so we
can set that segment of artists aside and concentrate on the
larger swath of commercial image-makers.

An opinion popular with tech and unpopular with crea-
tives: The output of physical media will continue to be
used in training generative AIs.

I absolutely understand the desire to prevent this. Knowing
your work has been used without permission to train a
computer to replace people’s livelihoods is extremely
violating. But understanding the technical basis, I don’t
see any plausible way to outlaw it while still allowing fair
use in all the ways human artists have been for thousands of
years. Images similar to those used to build the latent space
may be recoverable with the right prompt and some luck, but
they’re not inherently there, any more than my memory of
an Andy Warhol is inherently a copyright violation. I can sell
Andy Warhol pastiches I make based on that memory. I can
augment my memory by having a morgue file of images to
train my memory on.

If you have a vision for how this can be structured legally, rest-
ricting ML uses of imagery without restricting human uses,
I’d love to hear about it!

An opinion popular with tech-loving managers and
unpopular with creatives: Generative AI will replace a
slice of illustration and writing: in particular, the kind
where the content doesn’t actually matter:

This blog post needs a header image that’s vaguely
related, not because it needs illustration but to fit the
page layout.

This spammy site needs a new blog post once a week for
SEO reasons.

No one from one end of the process to the final consumer
particularly cares about the image or the text as long as it
doesn’t stand out; it is furniture.

This kind of work never paid particularly well and is now
rapidly vanishing. My heart goes out to the people suffering
because this work is vanishing, but I also can’t see any way
out, even with much stronger legal regulation of generative
AI than I expect we’ll ever see.

Why only that slice?

If you’ve ever used a generative system, I can pretty much
guarantee that you spent an embarrassing amount of time
making tiny adjustments to your prompt and retrying. Produ-
cing a compelling image with generative AI is pretty easy;
maybe one in ten images it generates will make you say,
“Wow, cool!” But producing a specific image with generative
AI is sometimes almost impossible.

If you visit (often NSFW, beware!) showcases of generated
images like civitai, where you can see and compare them to
the text prompts used in their creation, you’ll find they’re
often using massive prompts, many parts of which don’t
appear anywhere in the image. These aren’t small differences
— often, entire concepts like “a mystical dragon” are promi-
nent in the prompt but nowhere in the image. These users are
playing a gacha game, a picture-making slot machine. They’re
writing a prompt with lots of interesting ideas and then pulling
the arm of the slot machine until they win… something. A
compelling image, but not really the image they were asking
for.

Why is it so hard to get what you want?

Let’s return to the technical discussion for a second.

Text is a difficult way to steer an image because while the
text latent space is related to the image latent space, there
are still multiple translation steps: from the actual prompt to
the text latent space to a function in the image latent space.
The process can only accommodate so much text at a time
(usually ~75 words; if there are more than that, you must break
the prompt into separate guiding systems in cross-attention).
OK. Are there better ways to direct image generation to have
specific results?

Yes! It’s much easier to translate an image into a latent space
constraint. Images translate very well into image latent space;
that’s what image latent space is for. Here are some ways
folks have invented to prompt image generation using images
rather than words

Create an image that would reduce to the same line art as another image
Create an image that would reduce to the same depth map as was pulled from another image
Create an image with matching poses pulled for another image
Create an image whose style matches another image even though the content differs
match perspective lines of another image
match the colors palette of another image

These are all very powerful constraints that can exert precise
control over the content and composition of a generated
image.

The only challenge in using them is: where do all these guiding
images come from? Who can take the time to understand the
concepts we want illustrations of and turn them into a line
drawing, a sketch of poses, or a style? Is there some existing
job title for that?

An opinion popular with creatives and unpopular with
techy managers: Generative AI isn’t much use for sophis-
ticated needs if there isn’t an illustrator involved.

I believe that will continue to hold true even for future
versions of Midjourney, DALL-E, and so on; I think the
amount of text they can handle will increase, and the quality
and resolution of images they produce will increase, but the
fundamental challenge of getting specific imagery is not going
to vanish without more fundamental changes in the approach.

Finally, an opinion popular with no one: Commercial
illustrators will keep their jobs, but will mostly need to
learn to use AI as a part of their workflow to maintain a
higher pace of work.

This doesn’t mean illustrators will stop drawing and become
prompt engineers. That will waste an immense amount of
training and gain very little. Instead, I foresee illustrators
concentrating even more on capturing the core features of an
image, letting generative AI fill in details, and then correcting
those details as necessary.

Here’s a process for digital painting that I’ve tested and
found… plausible:

Produce a line drawing traditionally, focusing on the composition and key ideas
Have the generative AI suggest a dozen potential approaches to color and lighting; pick one or two
Paint almost entirely over those AI generated pixels, adjusting and correcting the color to suit my vision

Obviously, this is not a workable approach for artists that put
great care and emotion into their color choices. I don’t think
there will be any one approach that works for all artists. But
for artists working on deadlines, I foresee them using AI to
fill in whatever step is the least important and most tedious:
crowd scenes, cityscapes, vegetation. Just like a blog post
header image is furniture for the page, there is furniture for
many images — not important, but still necessary. For better
or worse, that furniture is becoming the territory of generative
AI.

The more concerning problem is that while generative AI
research is heading in this direction, offering more and more
ways to direct image generation using image inputs, the prod-
ucts that are entering the market are not easy to slot into an
illustrator’s workflow at all. All my experiments have been
done running open-data models on my own computer in order
to have useful levels of control.

I have more to say on the subject of machine creativity and
also the gacha-like nature of generative AI, but I think it best
to leave this post here, with that vision of commercial illust-
ration yet to come, and the hope that generative AI products
will start catering to it.

Don't fire your illustrator

Don't Fire Your Illustrator

PART I: Stable diffusion

I a) The Latent Space

I b) Stable diffusion

PART II: Who should use AI image generation, and how

Recommend

广州将支持大型保障房小区试点“物业服务+养老服务”

You Can Jailbreak Your Amazon Fire TV Stick (But Should You?)

Video Shows How Earth's Tectonic Plates Moved Over 1 Billion Years

From data to intelligence: Jeff Denworth reveals Vast Data's journey toward a sm...

Guns N’ Roses首次公开演唱新歌《Perhaps》

uBlock Origin Lite now available on Firefox

软件工程：领导力与价值感

波士顿咨询发布报告，挖掘中老年群体的消费潜力

看完这组动画，被“毛孩子”狠狠治愈了

The best MacBooks for 2023: How to pick the best Apple laptop

About Joyk