Ask HN: Can someone ELI5 transformers and the “Attention is all we need” paper?

		Ask HN: Can someone ELI5 transformers and the “Attention is all we need” paper?
	326 points by Sai_ 7 hours ago \| hide \| past \| favorite \| 134 comments
	I have zero AI/ML knowledge but Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel. Makes me want to better understand this tech. Edit: thank you for some amazing top level responses and links to valuable content on this subject.

The "Attention is All You Need" paper introduced a new way for AI to read and understand language, much like how we might read a comic book.

As you read each panel of a comic book, you don't just look at the words in the speech bubbles, but you also pay attention to who's talking, what they're doing, and what happened in the previous panels. You might pay more attention to some parts than others. This is sort of like what the Transformer model does with text.

When the Transformer reads a sentence, it doesn't just look at one word at a time. It looks at all the words at once, and figures out which ones are most important to understand each other. This is called "attention." For example, in the sentence "The cat, which is black, sat on the mat," the Transformer model would understand that "cat" is connected to "black" and "sat on the mat."

The "attention" part is very helpful because, like in a comic book, understanding one part of a sentence often depends on understanding other parts. This makes the Transformer model really good at understanding and generating language.

Also, because the Transformer pays attention to all parts of the sentence at the same time, it can be faster than other models that read one word at a time. This is like being able to read a whole page of your comic book at once, instead of having to read each panel one by one.

Explaining it for a slightly older audience, a transformer is a type of artificial neural network designed for processing sequences, like sentences in a text. It's especially known for its use in natural language processing (NLP), which is the field of AI that deals with understanding and generating human language.

The Transformer is unique because it uses a mechanism called "attention" to understand the relationships between words in a sentence, which works like this:

(1) Encoding: First, the Transformer turns each word in a sentence into a list of numbers, called a vector. These vectors capture information about the word's meaning.

(2) Self-Attention: Next, for each word, the Transformer calculates a score for every other word in the sentence. These scores determine how much each word should contribute to the understanding of the current word. This is the "attention" part. For example, in the sentence "The cat, which is black, sat on the mat," the words "cat" and "black" would get high scores when trying to understand the word "black" because they are closely related.

(3) Aggregation: The Transformer then combines the vectors of all the words, weighted by their attention scores, to create a new vector for each word. This new vector captures both the meaning of the word itself and the context provided by the other words in the sentence.

(4) Decoding: Finally, in a task like translation, the Transformer uses the vectors from the encoding phase to generate a sentence in the target language. It again uses attention to decide which words in the original sentence are most relevant for each word it's trying to generate in the new sentence.

One key advantage of the Transformer is that it can calculate the attention scores for all pairs of words at the same time, rather than one at a time like previous models. This allows it to process sentences more quickly, which is important for large tasks like translating a whole book.

Is there a way to have recuesively constructed attentional architectures? It would seem like the same process that you describe could be even more useful if it could be applied at the level of sentences, paragraphs, etc.

The importance of the "Attention is All You Need" paper by Vaswani et al., in 2017 is that it introduced the Transformer type of model architecture.

The model is so named because it "transforms" one sequence into another. For example, in a machine translation task, it transforms a sentence in one language into a sentence in another language.

The key innovation of the Transformer model is the use of self-attention mechanisms. This means that instead of processing the input sequence word by word, the model considers all the words in the sequence at the same time and learns to pay "attention" to the most important ones for the given task.

In essence, the Transformer model is a design for building network architectures that can process data in parallel and focus on different parts of the data depending on the task at hand. The Transformer model has proven to be highly effective and flexible, and has been adopted in many variants and applications, including BERT, GPT, T5, and many others.

great explanation. thank you all for contributing to our learning!

> in a machine translation task, it transforms a sentence in one language into a sentence in another language.

here English is being translated to which language - I'm assuming vectors? might be a silly question, I'm assuming that's where the origin of the word "Transformer" lies

It's chat-gpt generated. It's even leaked part of the prompt in the intro.

I especially disagree with:

> natural language processing (NLP), which is the field of AI that deals with understanding and generating human language.

how is "attention" different from using tokens > vector database > cosine similarity?

If you're specifically focused on semantic similarity, I would say that attention adds to the dimensionality of the vector space. Distances between tokens can vary depending on context.

The Yannic kilcher review is quite good.

https://youtu.be/iDulhoQ2pro

I can't ELI5 but I can ELI-junior-dev. Tl;dw:

Transformers work by basically being a differentiable lookup/hash table. First your input is tokenized and (N) tokens (this constitutes the attention frame) are encoded both based on token identity and position in the attention frame.

Then there is an NxN matrix that is applied to your attention frame "performing the lookup query" over all other tokens in the attention frame, so every token gets a "contextual semantic understanding" that takes in both all the other stuff in the attention frame and it's relative position.

Gpt is impressive because the N is really huge and it has many layers. A big N means you can potentially access information farther away. Each layer gives more opportunities to summarize and integrate long range information in a fractal process.

Two key takeaways:

- differentiable hash tables

- encoding relative position using periodic functions

NB: the attention frame tokens are actually K-vectors (so the frame is a KxN matrix) and the query matrix is an NxNxK tensor IIRC but it's easier to describe it this way

I appreciate the explanation, but I don't know what junior-dev would understand most of this. I may be just a web developer, but I couldn't understand most of this. I'd still have to read for 30m to grok it all.

Yeah sorry, it still requires math and probably some exposure to ML basics.

I think one hole in the description for simplicity is that "differentiable" it's not an adjective that applies to hash tables.

Differentiable relative to what? What is (x) in the d(hashtable)/d(x) equation?

I think it applies because lookups can be done by multiplying one-hot vectors (or masks) with matrices; this is roughly analogous to what happens when we multiply Q with K^T in a self-attention head.

Read this: https://e2eml.school/transformers.html#table_lookup

And then read: https://e2eml.school/transformers.html#attention

Matrix multiplication is differentiable as it is continuous. Therefore you can calculate partial derivatives of these operations. The ability to do that is what allows gradient descent optimization via the chain rule.

  > Differentiable relative to what?
  > What is (x) in the d(hashtable)/d(x) equation?

I think the derivative we actually care about is `∂L/∂(hashtable-lookup)` but here the `hashtable-lookup` is actually the matrix multiplication mentioned above. We want to know how much the "loss" (or error) changes with respect to each of our "lookups". Knowing how each of our "lookups" causes our model to err from the output we expect, allows us to optimise it.

Note: it's not really a "lookup" in the normal sense of the word. Hashtable lookups are discontinuous since we either have a value for a particular key or we do not. Our matrix multiplication is sort of like a fuzzy, probabilistic, continuous form of lookup in which nearby keys get partially activated based on their similarity to the query, and in which a small change in this query causes continous changes to the keys produced.

As far as my understanding of the self-attention equation (e.g. `softmax(QK^T / sqrt(d_k))V`) goes, its actually quite important that we get this fuzzy output in which lots of keys get partially activated for a particular query. If it only picked the maximum similarity and ignored the rest, there would be less information propagating through the network and it'd be harder for the network to learn relationships/interactions between inputs. This is why we scale `QK^T` by `sqrt(d_k)` in order to pass a tighter range of values into the `softmax()` function (which importantly generates probabilities that sum to 1, but contains exponentials which give it a tendency to over-emphasize the maximum value and ignore other values if they are too far apart).

That's exactly the point, though! It's surprising. A hashtable is a map from keys to values. Making it differentiable means that a small change in the key also makes a small change in the value!

What does it mean for a lookup/hash table to be differentiable?

I'm not a ML expert but I know a bit about math.

It's "differentiable" in the same way that e.g. the "jump function" (Heaviside step function) is differentiable (not as a function from real numbers to real numbers, but as a distribution). It's derivative is the "point impulse function" (Dirac delta function), which, again, is a distribution, not a real function.

Distributions are nicely defined in math, but can't really be operated with numerically (at least not in the same way as real/float functions), but you can approximate them using continuous functions. So instead of having a function jump from 0 to 1, you "spread" the jump and implement it as a continuous transition from e.g. `0-epsilon` to `0+epsilon` for some tiny epsilon. Then you can differentiate it as usual, even numerically.

Similarly, hash table lookup is a dis-continuous function - the result of `hash.get(lookup)` is just `value` (or `null`). To make it continuous, you "spread" the value, so that nearby keys (for some definition of "nearby") will return nearby values.

One way to do this, is to use the scalar product between `lookup` and all keys in the hashtable (normalized, the scalar product is close to 1 if the arguments are "nearby"), and use the result as weights to multiply with all values in the hashtable. That's what the transformer does.

Differentiable in this context, means that it can be learned by gradient descent, which uses derivatives to adjust a neural network model's parameters to minimise some error measure. As for how that applies to a hash function, I think the lookup gives some sort of weighting for each possible output, for each input, with the largest weightings corresponding to a "match". But tbh I'm not certain on that last part...

> As for how that applies to a hash function, I think the lookup gives some sort of weighting for each possible output, for each input, with the largest weightings corresponding to a "match"

Note that in general one "matches" more than one thing in which case (IIRC) the result will be a weighted linear combination of the query results.

I do think that this collapses to a datastructure equivalent in function to the "normal hash table" (but not in performance, lol) in the degenerate case where every input yields strictly a single result with zero activation in all other results, so it's not invalid to call it a differentiable lookup table.

Going to go out on a limb and say they are probably referring to the gradient calculus required for updating the model.

https://en.wikipedia.org/wiki/Differentiable_programming

See automatic differentiation.

Correct, but note that if you subject a standard hash table algo to AD it won't magically become a transformer. (Hashes in the "normal construction" are discrete functions and thus aren't really continuous or differentiable, neither are lookup tables)

You can take a (calculus) derivative of the mathematical construct that represents the lookup table. And yeah, it is very not obvious how a "lookup/hash table" could be differentiable, based on how it's implemented say, in a leetcode exercise. That's part of the genius of the transformer.

It's not really doing hashing, which is random, but instead it makes meaningful connections, like connecting an adjective to its noun, or a pronoun to a previous reference of the name. That is why it is called "Attention is all you need" and not "Hashing is all you need". Attention includes some meaning, it is attentive to some aspect or another.

Hashing does not have to be random, it just has to compress the domain. Modulo number is a perfectly valid hash for some use cases.

Edit: just looked it up, doesn't even have to compress the domain, identity hash is a thing.

I wanted to ask the same and especially I've always been wondering: How is the meaning of aforementioned 'differentiable' related to the same term in math?

Not an expert (so this could well be slightly off), but here is a physics analogy:

You're rolling a ball down a plank of wood and you want it to roll off the plank and continue to a certain spot on the ground. You know that if the ball rolls past the target, you have to angle the ramp lower. If it fails to reach the target, you angle the ramp higher. If it goes past by a lot, you make a big change to the ramp angle. If it goes past by a little, you make a small change to the ramp angle. In this way your error tells you something about the adjustment you need to make to the system to produce the desired output.

Think of a function from the ramp angle (network weights) to how far the ball lands from the target (the error). The derivative of this function tells you what kind of change in angle results in what kind of change in the error, and you can use that to update the ramp until you hit the target. To say that the model is differentiable is to say that it's possible to infer from the error what kind of changes you need to make to the weights.

Same thing. You need everything to be differentiable in order to run gradient descent. To first approximation, training a neural network is just gradient descent.

The neural net is just a math function, continuous even, fully differentiable in all input points. In order to "learn" anything we compute gradients towards the function parameters. They get "nudged" slightly towards a better response, and we do this billions of times. It's like carving a raw stone block into a complex scene. If you put your data into the system it flows towards the desired output because the right path has been engraved during training.

<rant> This explains a bit how neural nets work but from this to chatGPT is another whole leap. You'd have to assign some of the merits of the AI to the training data itself, it's not just the algorithm for learning, but what is being learned that matters. The neural net is the same, but using 1T tokens of text is making it smart. What's so magic about this data that it can turn a random init into a language programmable system? And the same language information make a baby into a modern human, instead of just another animal. </>

No, it's the loss function we differentiate. The input to the loss function are the network weights. The input to the network are samples and those we do not differentiate.

While it's true that we don't differentiate the input samples, we do differentiate the loss function's output with respect to each of the network weights. We use the chain rule to calculate each of these "gradients" and that process is known as backpropagation.

(You might have intended to say this, in which cases I'm just trying to add clarity.)

The lookup/hashtable can be viewed as a matrix which is accessed by multiplying it with a vector.

One thing that might be worth pointing out is that the transformer architecture owes a great deal of its success to the fact that it can be implemented in a way that it can be massively parallelized in a very efficient manner.

Compared to rnns... maybe? The big nxn is really a killer.

I don't know how to judge parallelizability of different DNN models, you're comparing apples to oranges

When you train a transformer, you're training what the next expected token is. You can train all positions of the sequence each in parallel rather than having to sequentially build up the memory state as you generate the sequence with an LSTM. Mind you the inference portion of a transformer is still sequentially bottlenecked since you don't know what the output sequence is supposed to be.

I believe the other big thing is it allowed you to parallelize the training as well, so you could split those colossal training sets across many machines in a way you could not do with LSTMs.

I don't really believe this because in theory it's possible to split up training of arbitrary neural nets using decoupled neural interfaces (not that it's easy mind you) but at the point where you have openAI scale resources the bottleneck to achieving that is the difficulty of Python.

I had to look up "tl;dw" and realized it meant "too long; didn't watch" and not my first AI-laden instinct of "too long, didn't write" :-D

I came here to post this video. It’s a great primer on the topic and it gives you ideas to prompt gpt and have it output more.

It’s how I got an understanding of beam search, a technique employed in some of the response building.

Having read the paper myself, I'm impressed with the quality of your explanation. Well done!

what is a differentiable hash table? I understand differentiation, I don't understand differential hash table ... does that mean for every tiny gradient of a key, a corresponding gradient in value can be expected?

How does N relate to the number of parameters that is frequently mentioned?

In my screed, N is the attention width. (How many token it looks at at a time) number of parameters is O(KxNxNxL) where k is the vector size of your tokens, and l is the # of layers. There are other parameters floating around, like in the encoder and decoder matrices, but the NXN matrix dominates.

This is an awesome explanation. You guys are the real heroes

There are some good explanations here of the self-attention architecture that makes Transformers unique.

However, most people gloss over other aspects of the "Attention is all you need" paper, which is a sense mis-titled.

For example, Andrej Karpathy pointed out that the paper had another significant improvement hidden in it: during training the gradients can take a "shortcut" so that the bottom layers are trained faster than in typical deep learning architectures. This enables very large and deep models to be trained in a reasonable time. Without this trick, the huge LLMs seen these days would not have been possible!

Andrej talks about it here: https://youtu.be/9uw3F6rndnA?t=238

Okay, here's my attempt!

First, we take a sequence of words and represent it as a grid of numbers: each column of the grid is a separate word, and each row of the grid is a measurement of some property of that word. Words with similar meanings are likely to have similar numerical values on a row-by-row basis.

(During the training process, we create a dictionary of all possible words, with a column of numbers for each of those words. More on this later!)

This grid is called the "context". Typical systems will have a context that spans several thousand columns and several thousand rows. Right now, context length (column count) is rapidly expanding (1k to 2k to 8k to 32k to 100k+!!) while the dimensionality of each word in the dictionary (row count) is pretty static at around 4k to 8k...

Anyhow, the Transformer architecture takes that grid and passes it through a multi-layer transformation algorithm. The functionality of each layer is identical: receive the grid of numbers as input, then perform a mathematical transformation on the grid of numbers, and pass it along to the next layer.

Most systems these days have around 64 or 96 layers.

After the grid of numbers has passed through all the layers, we can use it to generate a new column of numbers that predicts the properties of some word that would maximize the coherence of the sequence if we add it to the end of the grid. We take that new column of numbers and comb through our dictionary to find the actual word that most-closely matches the properties we're looking for.

That word is the winner! We add it to the sequence as a new column, remove the first-column, and run the whole process again! That's how we generate long text-completions on word at a time :D

So the interesting bits are located within that stack of layers. This is why it's called "deep learning".

The mathematical transformation in each layer is called "self-attention", and it involves a lot of matrix multiplications and dot-product calculations with a learned set of "Query, Key and Value" matrixes.

It can be hard to understand what these layers are doing linguistically, but we can use image-processing and computer-vision as a good metaphor, since images are also grids of numbers, and we've all seen how photo-filters can transform that entire grid in lots of useful ways...

You can think of each layer in the transformer as being like a "mask" or "filter" that selects various interesting features from the grid, and then tweaks the image with respect to those masks and filters.

In image processing, you might apply a color-channel mask (chroma key) to select all the green pixels in the background, so that you can erase the background and replace it with other footage. Or you might apply a "gaussian blur" that mixes each pixel with its nearest neighbors, to create a blurring effect. Or you might do the inverse of a gaussian blur, to create a "sharpening" operation that helps you find edges...

But the basic idea is that you have a library of operations that you can apply to a grid of pixels, in order to transform the image (or part of the image) for a desired effect. And you can stack these transforms to create arbitrarily-complex effects.

The same thing is true in a linguistic transformer, where a text sequence is modeled as a matrix.

The language-model has a library of "Query, Key and Value" matrixes (which were learned during training) that are roughly analogous to the "Masks and Filters" we use on images.

Each layer in the Transformer architecture attempts to identify some features of the incoming linguistic data, an then having identified those features, it can subtract those features from the matrix, so that the next layer sees only the transformation, rather than the original.

We don't know exactly what each of these layers is doing in a linguistic model, but we can imagine it's probably doing things like: performing part-of-speech identification (in this context, is the word "ring" a noun or a verb?), reference resolution (who does the word "he" refer to in this sentence?), etc, etc.

And the "dot-product" calculations in each attention layer are there to make each word "entangled" with its neighbors, so that we can discover all the ways that each word is connected to all the other words in its context.

So... that's how we generate word-predictions (aka "inference") at runtime!

By why does it work?

To understand why it's so effective, you have to understand a bit about the training process.

The flow of data during inference always flows in the same direction. It's called a "feed-forward" network.

But during training, there's another step called "back-propagation".

For each document in our training corpus, we go through all the steps I described above, passing each word into our feed-forward neural network and making word-predictions. We start out with a completely randomized set of QKV matrixes, so the results are often really bad!

During training, when we make a prediction, we KNOW what word is supposed to come next. And we have a numerical representation of each word (4096 numbers in a column!) so we can measure the error between our predictions and the actual next word. Those "error" measurements are also represented as columns of 4096 numbers (because we measure the error in every dimension).

So we take that error vector and pass it backward through the whole system! Each layer needs to take the back-propagated error matrix and perform tiny adjustments to its Query, Key, and Value matrixes. Having compensated for those errors, it reverses its calculations based on the new QKV, and passes the resultant matrix backward to the previous layer. So we make tiny corrections on all 96 layers, and eventually to the word-vectors in the dictionary itself!

Like I said earlier, we don't know exactly what those layers are doing. But we know that they're performing a hierarchical decomposition of concepts.

Hope that helps!

This was a very helpful visualization, thank you!

The "entanglement" part intuitively makes sense to me, but one bit I always get caught up on the key, query, and value matrices. In every self-attention explanation I've read/watched they tend to get thrown out there and similar to what you did here but leave their usage/purpose a little vague.

Would you mind trying to explain those in more detail? I've heard the database analogy where you start with a query to get a set of keys which you then use to lookup a value, but that doesn't really compute with my mental model of neural networks.

Is it accurate to say that these separate QKV matrices are layers in the network? That doesn't seem exactly right since I think the self-attention layer as a whole contains these three different matrices. I would assume they got their names for a reason that should make it somewhat easy to explain their individual purposes and what they try to represent in the NN.

I'm still trying to get a handle on that part myself... But my ever-evolving understanding goes something like this:

The "Query" matrix is like a mask that is capable of selecting certain kinds of features from the context, while the "Key" matrix focuses the "Query" on specific locations in the context.

Using the Query + Key combination, we select and extract those features from the context matrix. And then we apply the "Value" matrix to those features in order to prepare them for feed-forward into the next layer.

There are multiple "Attention Heads" per layer (GPT-3 had 96 heads per layer), and each Head performs its own separate QKV operation. After applying those 96 Q+K->V attention operations per layer, the results are merged back into a single matrix so that they can be fed-forward into the next layer.

Or something like that...

I'm still trying to grok it myself, and if anyone here shed more light on the details, I'd be very grateful!

I'm still trying to understand, for example, how many QKV matrices are actually stored in a model with a particular number of parameters. For example, in a GPT-NeoX-20B model (with 20 billion params) how many distinct Q, K, and V matrices are there, and what is their dimensionality?

That was incredible. Thank you! If you made it into an article with images showing the mask/filter analogy, it might be one of the best/most unique explanations I've seen. Love the ground-up approach beginning with data's shape.

Reminded me of the style of a book on machine learning. If anyone liked this explanation, you may appreciate this book:

https://www.amazon.com/Applied-Machine-Learning-Engineers-Al...

ELI5 is tricky as details have to be sacrificed, but I'll try.

An attention mechanism is when you want a neural network to learn the function of how much attention to allocate to each item in a sequence, to learn which items should be looked at.

Transformers is a self-attention mechanism, where you ask the neural network to 'transform' each element by looking at its potential combination with every other element and using this (learnable, trainable) attention function to decide which combination(s) to apply.

And it turns out that this very general mechanism, although compute-intensive (it considers everything linking with everything, so complexity quadratic to sequence length) and data-intensive (it has lots and lots of parameters, so needs huge amounts of data to be useful) can actually represent many of things we care about in a manner which can be trained with the deep learning algorithms we already had.

And, really, that's the two big things ML needs, a model structure where there exists some configuration of parameters which can actually represent the thing you want to calculate, and that this configuration can actually be determined from training data reasonably.

Well here is my (a bit cynical) take on it.

In the beginning, there was the matrix multiply. A simple neural network is a chain of matrix multiplies. Let's say you have your data A1 and weights W1 in a matrix. You produce A2 as A1xW1. Then you produce A3 as A2xW2, and so on. There are other operations in there like non-linearities (so that you can actually learn something interesting) and fancy batch norms, but let's forget about those for now. The problem with this is, it's not very expressive. Let's say your A1 matrix has just 2 values, and you want the output to be their product. Can you learn a weight matrix that performs multiplication of these inputs? No you can't. Multiplication must be simulated by piecing together piecewise linear functions. To perform multiplication, the weight matrix W would also need to be produced by the network. Transformers do basically that. In the product A*W you replace A with (AxW1), W with (AxW2), and multiply those together: (AxW1)x(AxW2) And then do it once more for good measure: (AxW1)x(AxW2)x(AxW3). Boom, Nobel prize. Now your network can multiply, not just add. OK it's actually a bit more complicated, there is for example a softmax in the middle to perform normalisation, which in general helps during numerical optimisation: softmax((AxW1)x(AxW2))x(AxW3). There are then fancy explanations that try to retrospectively justify this as a "differentiable lookup table" or somesuch nonsense, calling the 3 parts "key", "query" and "value", which help make your paper more popular. But the basic idea is not so complicated. A Transformer then uses this operation as a building block (running them in parallel an in sequence) to build giant networks that can do really cool things. Maybe you can teach networks to divide next and then you get the next Nobel prize.

Technically speaking, the breakthrough was also the fact that it allowed for parallelization of running the computation. Instead of going word by word in a sequence, and optimizing for the next word, the approach shifted to looking at words independently and then applying the same statistical approach of finding the next word relative to that word or sequence. Then the final outcome was a weighted sum of these independent pieces.

I don't think you really want a 5yo's version, because that's "computers can be smart". But for an SDE1-level explanation I made this video which has gotten tons of compliments and keeps people watching year after year:

https://www.youtube.com/watch?v=S27pHKBEp30

It's in the context of NLP, which is where transformers started of course.

Prompt: "ELI5 the Attention is All We Need paper."

Imagine you have a box of toys. Some toys are more interesting to you than others, so you pay more attention to those toys and play with them more. The same thing happens in the "Attention is All You Need" paper, but instead of toys, we have words in a sentence.

Before this paper, when computers tried to translate one language to another, they would look at one word, translate it, then move to the next word. This works okay, but it's not great because sometimes the meaning of a word can depend on other words in the sentence.

The clever thing about the "Attention is All You Need" paper is that it taught the computer to pay attention to all the words in a sentence at the same time, but give more importance ("attention") to the words that matter most for understanding and translation.

This new way of translating languages using "attention" made computers much better at understanding and translating languages. It's like if you could become better at playing by understanding all your toys at once! This paper was a big deal in the field of machine learning and artificial intelligence because it improved how machines understand languages.

Those Computerphile videos[0] by Rob Miles helped me understand transformers. He specifically references the "Attention is all you need" paper.

And for a deeper dive, Andrej Kharpaty has this hands-on video[1] where he builds a transformer from scratch. You can check-out his other videos on NLP as well they are all excellent.

[0] https://youtu.be/rURRYI66E54, https://youtu.be/89A4jGvaaKk

[1] https://youtu.be/kCc8FmEb1nY

It works like this:

First, convert the input text to a sequence of token numbers (2048 tokens with 50257 possible token values in GPT-3) by using a dictionary and for each token, create a vector with 1 at the token index and 0 elsewhere, transform it with a learned "embedding" matrix (50257x12288 in GPT-3) and sum it with a vector of sine and cosine functions with several different periodicities.

Then, for each layer, and each attention head (96 layers and 96 heads per layer in GPT-3), transform the input vector by query, key and value matrices (12288x128 in GPT-3) to obtain a query, key and value vector for each token. Then for each token, compute the dot product of its query vector with the key vectors of all previous tokens, scale by 1/sqrt of the vector dimension and normalize the results so they sum to 1 by using softmax (i.e. applying e^x and dividing by the sum), giving the attention coefficients; then, compute the attention head output by summing the value vectors of previous tokens weighted by the attention coefficients. Now, for each token, glue the outputs for all attention heads in the layer (each with its own key/query/value learned matrices), add the input and normalize (normalizing means that the vector values are biased and scaled so they have mean 0 and variance 1).

Next, for the feedforward layer, apply a learned matrix, add a learned vector and apply a ReLU (which is f(x) = x for positive x and f(x) = kx with k near 0 for negative x), and do that again (12288x49152 and 49152x12288 matrices in GPT-3, these actually account for around 70% of the parameters in GPT-3), then add the input before the feedforward layer and normalize.

Repeat the process for each layer, each with their own matrices, passing the output of the previous layer as input. Finally, apply the inverse of the initial embedding matrix and use softmax to get probabilities for the next token for each position. For training, train the network so that they are close to the actual next token in the text. For inference, output a next token according to the top K tokens in the probability distribution over a cutoff and repeat the whole thing to generate tokens until an end of text token is generated.

- You can develop a very deep understanding of a sequence by observing how each element interacts with each other over many sequences.

- This understanding can be encapsulated in "compressed" low dimensional vector representation of a sequences.

- You can use this understanding for many different downstream tasks, especially predicting the next item in a sequence.

- This approach scales really well with lots of GPUs and data and is super applicable to generating text.

I'll throw my hat in the ring.

A transformer is a type of neural network that, like many networks before, is composed of two parts: the "encoder" that receives a text and builds an internal representation of what the text "means"[1], and the "decoder" that uses the internal representation built by the encoder to generate an output text. Let's say you want to translate the sentence "The train is arriving" to Spanish.

Both the encoder and decoder are built like Lego, with identical layers stacked on top of each other. The lowest lever of the encoder looks at the input text and identifies the role of individual words and how they interact with each other. This is passed to the layer above, which does the same but at a higher level. In our example it would be as if the first layer identified that "train" and "arrive" are important, then the second one identifies that "the train" and "is arriving" are core concepts, the third one links both concepts together, and so on.

All of these internal representations are then passed to the decoder (all of them, not just the last ones) which uses them to generate a single word, in this case "El". This word is then fed back to the decoder, that now needs to generate an appropriate continuation for "El", which in this case would be "tren". You repeat this procedure over and over until the transformer says "I'm done", hopefully having generated "El tren está llegando" in the process.

The attention mechanism already existed before transformers, typically coupled with an RNN. The key concept of the transformer was building an architecture that removed the RNN completely. The negative side is that it is a computationally inefficient architecture as there are plenty of n^2 operations on the length of the input [2]. Luckily for us, a bunch of companies started releasing for free giant models trained on lots of data, researchers learned how to "fine tune" them to specific tasks using way less data than what it would have taken to train from scratch, and transformers exploded in popularity.

[1] I use "mean" in quotes here because the transformer can only learn from word co-occurrences. It knows that "grass" and "green" go well together, but it doesn't have the data to properly say why. The paper "Climbing towards NLU" is a nice read if you care about the topic, but be aware that some people disagree with this point of view.

[2] The transformer is less efficient that an LSTM in the total number of operations but, simultaneously, it is easier to parallelize. If you are Google this is the kind of problem you can easily solve by throwing a data center or two at the problem.

> The negative side is that it is a computationally inefficient architecture as there are plenty of n^2 operations on the length of the input

Is this the reason for the limited token windows?

Transformers are about converting some input data (usually text) to numeric representations, then modifying those representations through several layers to generate a target representation.

In LLMs, this means go from prompt to answer. I'll cover inference only, not training.

I can't quite ELI5, but process is roughly:

  - Write a prompt
  - Convert each token in the prompt (roughly a word) into numbers.  So "the" might map to the number 45.
  - Get a vector representation of each word - go from 45 to [.1, -1, -2, ...]. These vector representations are how a transformer understands words.  
  - Combine vectors into a matrix, so the transformer can "see" the whole prompt at once.
  - Repeat the following several times (once for each layer):
  - Multiply the vectors by the other vectors.  This is attention - it's the magic of transformers, that enables combining information from multiple tokens together.  This generates a new matrix.
  - Feed the matrix into a linear regression.  Basically multiply each number in each vector by another number, then add them all together.  This will generate a new matrix, but with "projected" values.
  - Apply a nonlinear transformation like relu.  This helps model more complex functions (like text input -> output!)

Note that I really oversimplified the last few steps, and the ordering.

At the end, you'll have a matrix. You then convert this back into numbers, then into text.

I don't think this description of attention is correct.

You mean "Multiply the vectors by the other vectors. This is attention - it's the magic of transformers, that enables combining information from multiple tokens together. This generates a new matrix."?

It's really oversimplified, as I mentioned. A more granular look is:

  - Project the vectors with a linear regression.  In decoder-only attention (what we usually use), we project the same vectors twice with different coefficients.   We call the first projection queries, and the second keys.  This transforms the vectors linearly.
  - Find the dot product of each query vector against the key vectors (multiply them)
  - (training only) Mask out future vectors, so a token can't look at tokens that come after it
  - At this point, you will have a matrix indicating how important each query vector considers each other vector (how important each token considers the other tokens)
  - Take the softmax, which both ensures all of the attention values for a vector sum to 1, and penalizes small attention values
  - Use the softmax values to get a weighted sum of tokens according to the attention calc.
  - This will turn one vector into the weighted sum of the other vectors it considers important.

The goal of this is to incorporate information from multiple tokens into a single representation.

Uhm nope but here's a ~70min talk I held about it: https://github.com/fkarg/things-to-talk-about/blob/master/tr...

Not sure how well the slides can be understood by themselves, though I tried to be accommodating for that

Not sure I can give an ELI5 explanation, but for me this the closest to what you are after in two short videos on the topic. At least they helped me get up to speed fast.

Hope they will do the same for you ;-)

Large Language Models from scratch https://www.youtube.com/watch?v=lnA9DMvHtfI

Large Language Models: Part 2 https://www.youtube.com/watch?v=YDiSFS-yHwk

I asked GPT (via Bing) for a five paragraph explanation. Here's what it wrote:

Sure, I'll try to explain what a transformer is in the context of AI tools like ChatGPT, and what attention references in this context. Here is a five paragraph explanation:

*Paragraph 1: What is a transformer?* A transformer is a type of neural network that can process sequential data, such as natural language text, by learning the context and meaning of the data elements. A neural network is a computer program that can learn from data and perform tasks such as classification, regression, generation, etc. A transformer can take an input sequence, such as a sentence or a paragraph, and convert it into another sequence, such as a translation or a summary. A transformer can also generate sequences from scratch, such as writing a poem or a story.

*Paragraph 2: How does a transformer work?* A transformer works by using two main components: encoders and decoders. An encoder takes an input sequence and converts it into a vector representation called an encoding. A decoder takes an encoding and converts it back into an output sequence. Both the encoder and the decoder use a technique called attention to focus on the relevant parts of the input and output sequences. Attention is a way of measuring how much each element in a sequence influences or depends on another element.

*Paragraph 3: What is attention?* Attention is a mathematical technique that allows a transformer to learn the relationships between the elements in a sequence. For example, in a machine translation task, attention allows the transformer to translate words like 'it' into the correct gender in another language by looking at all the relevant words in the original sentence. Attention also allows the transformer to handle long sequences without losing information or context. There are different types of attention, such as self-attention, cross-attention, global attention, local attention, etc.

*Paragraph 4: Why are transformers important?* Transformers are important because they are among the most powerful and versatile types of neural networks invented to date. They can handle many types of sequential data, such as text, speech, images, videos, etc. They can perform many tasks, such as machine translation, text summarization, text generation, image captioning, speech recognition, etc. They can also learn from large amounts of data and make accurate predictions that drive their wider use.

*Paragraph 5: What are some examples of transformers?* Some examples of transformers are:

- ChatGPT: A transformer that can generate realistic and engaging conversations based on a given topic or context. - BERT: A transformer that can understand natural language and answer questions or perform other tasks based on it. - GPT-3: A transformer that can generate text for any purpose or domain based on a few words or sentences. - DALL-E: A transformer that can generate images from text descriptions. - AlphaStar: A transformer that can play the strategy game StarCraft at a superhuman level.

I hope this explanation helps you understand what a transformer is and what attention references in this context.

Source: Conversation with Bing, 5/17/2023 (1) What Is a Transformer Model? | NVIDIA Blogs. https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transform.... (2) Transformer Neural Network Definition | DeepAI. https://deepai.org/machine-learning-glossary-and-terms/trans.... (3) Generative AI: AI Transformers. https://lablab.ai/blog/generative-ai-ai-transformers. (4) The Ultimate Guide to Transformer Deep Learning - Turing. https://www.turing.com/kb/brief-introduction-to-transformers.... (5) How Transformers Work. Transformers are a type of neural… | by Giuliano .... https://towardsdatascience.com/transformers-141e32e69591.

Check this explanation from Karpathy which provides really good intuition https://youtu.be/9uw3F6rndnA

Basically tokens “talk” to each other and say this is what i have and this is what i look for.

Your summary of this is very good. It’s a really simple and intuitive way to express something very complex.

This link was posted here recently, and was the most understandable explanation I've found so far: https://e2eml.school/transformers.html

What is ELI5?

The idea behind the Transformer is nice - but by far not Nobel prize deserving.

Don't believe the hype or people like Yegge, whoever that is - in a few years a new architecture will be the "Nobel candidate".

Also, the original Transformer paper, if you read is, is horribly written.

Nobel prize for the neurel network or perceptron maybe?

Not really an ELI5, but it you really want to understand the topic, I've gathered the best resources I could find to help me do the same: https://blog.senko.net/learn-ai

If you're a programmer, start with Karpathy's video series. For a somewhat gentler intro, take a look at the MIT intro lectures first to build up on the fundamentals.

Then you're ready for The Illustrated Transformer, and afterwards, if you're brave enough, the Annotated Transformer.

It's a fascinating subject, more so when you have a solid grasp! And you'll be able to quickly spot people who kinda stumble they way through but have big gaps in understanding.

Not quite ELI5 and there's a few partially overlapping answers around already but here goes.

The key part is the attention mechanism, as title of the paper may have spoiled. It works moreless like this:

- Start with an input sequence X1, X2 ... Xn. These are all vectors.

- Map the input sequence X into 3 new sequences of vectors: query (Q), key (K), and value(V), all of the same length as the input X. This is done using learnable mappings for each of the sequences (so one for X->Q, another for X->K and one for X->V).

- Compare similarity of every query with every key. This gives you a weight for each query/key pair. Call them W(Q1, V2) and so forth.

- Compute output Z as sum of every _value_ weighted by the weight for the respective query/key pair (so Z1 = V1W(Q1,K1) + V2W(Q1,K2) + ... + VnW(Q1,Kn), Z2 = V1W(Q2,K1) + V2*W(Q2,K2)...)

- and that's about it!

As throwawaymaths mentions, this is quite similar to a learnable hash table with the notable difference that the value fetched is also changed, so that it doesn't fetch "input at an index like i" but "whatever is important at an index like i".

Now a few implementation details on top of this:

- The description is for a single "attention head". Normally several, each with their own mappings for Q/K/V, are used, so the transformer can look at different "things" simultaneously. 8 attention heads seems pretty common.

- The description doesn't take the position in the sequence into account (W(Q1,K1) and W(Q1,Kn) are treated perfectly equally). To account for ordering, "positional encoding" is normally used. Usually this is just adding a bunch of scaled sine/cosine waves to the input. Works surprisingly well.

- The transformer architecture has a number of these "attention layers" stacked one after the other and also 2 different stacks (encoder, decoder). The paper is about machine translation, so the encoder is for the input text and the decoder for the output. Attention layers work just fine in other configurations as well.

The rest of the architecture is fairly standard stuff

There is no Nobel prize for math or computer science, though...

I worked on a few projects that were trying to develop foundation models for health care, aviation, and other domains. In particular I trained an LSTM model to write fake abstracts for clinical case reports.

We ran into many problems, but maybe the worst one with the LSTM is that a real document repeats itself a lot. For instance, somebody's name might turn up multiple times and the LSTM was very bad at that kind of copying. The LSTM community was arguing about solutions to this problem, but the attention mechanism in transformers makes it easy.

Thanks a lot for sharing. I liked the explanation, although it seems to be not a perfect matching between attention and kernel (as far as I understand). Since as I undestand x_o = Qx and x_i = Qx and y_i = Vx but it doesn't map to Wu dot Wv.

Anyway just curious do you or someone else have more sources in this format?

At least this way it is very helpfull to think about the methods. Kind of agree that the formular does look very very similar to well known methods. But on the other hand, the author doesn't explain the transformers in a similar simple way since it is not obvious why you could stack kernels and get better results.

I've been working through [0]. Like a lot of math, the notation is daunting, but once you become familiar with it, it really is a nice tool for thought.

[0]: https://arxiv.org/abs/2207.09238

I argue that the attention mechanism is just Hopf algebra convolution. The QK mechanism selects possible candidates and the OV mechanism picks one of these possible candidates.

I have recently written a paper on this https://arxiv.org/abs/2302.01834

I also have a discord channel https://discord.cofunctional.ai.

Ironically, it's the same mechanism as what renormalization in QFT does. I'm getting increasingly convinced that it's also how the brain works.

Could you ELI5 (or maybe 15, with high-school maths) this concept? Particularly how it may relate to how our brains might work? Most interested.

ELI5: In neural networks we could take data numbers, multiply by a constant, add a constant, add them to each other, and then do the same to the results again and again. And you could learn the right constants using labeled data. But we couldn’t multiply data numbers against each other. So in transformers you can. But you must multiply all numbers against all numbers in a specific sample, so it’s heavy.

Feedforward: y=Wx

Attention: y=W(x)x

W is Matrix, x & y Are vectors. In the second case, W is a function of the input.

You must be from a planet with very long years!

There is no way I can even begin to digest what you have said in your comment.

Sorry maybe I should have added more explanation. One way to think about attention, which is the main distinguishing element in a transformer, is as an adaptable matrix. A feedforward layer is a matrix with static entries that do not change at inference time (only during training). The attention mechanism offers a way to have adaptable weight matrices at inference time (this is implemented by using three different matrices, K,Q & V called keys query and value in case you want to dig deeper).

I think in your notation it should have been:

y=Wx_0

y=W(x)x_0

I guess I was more thinking about self attention, so yes. The more general case is covered by your notation!

Simplest explanation: you start with a fully connected network of N layers. This already gives you the functionality. Then you remove a lot of unnecessary connections. This gives you the performance.

It helps to start with recurrent neural networks first, since those were the previous standard way of doing next-token-prediction. They worked, but training them was extremely slow because it couldn't be parallelized. Transformers are a way of getting mostly the same capabilities as RNNs but with a parallelizable architecture so you can actually train it with huge parameter numbers in a reasonable amount of time.

I agree it's very useful to look at RNNs. Even more because attention mechanisms were already getting popular in LSTM/GRU models before the transformers paper. One of the main insights of the transformer papers is that you can do with just an attention attention and you do not need the recurrent part of the network anymore (hence the title "Attention is all you need").

Has ELI5 lost its literal meaning? What five-year-old would understand anything in this thread?

It was never meant to be taken literally, it's just an idiomatic way of asking that something be explained as simply as possible.

An answer to OP's question targeted at a five year old level would likely be impossible given the subject matter, or else need to be so general and simplistic as to be useless on this forum of not actual five year olds.

I don't know. It's simple and straightforward, but still seems like it would go over the head of an actual five year old.

One or both of these might help

Transformers, Explained: Understand the Model Behind GPT-3, BERT, and T5: https://daleonai.com/transformers-explained

Transformers from Scratch: https://e2eml.school/transformers.html

The first link covers Attention well.

So, if you got a time machine back to the year 2000 and knew how to code Transformers - could you do it? Your CPU/cloud/parallel processing hardware would be limited, but as I understand it there's nothing revolutionary being done here, just a lot of matrix math that produces results (and we're still fully understanding why that all happens).

"Matrix math" itself is very old, but I'm guessing what you're referring to here is probably neural networks or Markov chains, both of which are... a half-century to a century old, if I remember right?

Transformers on the other hand are new, less than a decade old.

I'd like to read the article. What are the required maths ?

Here is an attempt that sacrifices technical precision for intuition.

We're going to represent words as vectors (a sequence of numbers). We would like it to be the case that the value of the numbers reflects the meaning of the words. Words that mean similar things should be near each other. We also want to represent higher level ideas, ideas that take multiple words to express, in the same way. You can think of all the possible vectors as the entire space of ideas.

To begin with, though, we just have a vector for each word. This is insufficient - does the word "bank" mean the edge of a river or a place to store money? Is it a noun or a verb? In order to figure out the correct vector for a particular instance of this word, we need to take into account its context.

A natural idea might be to look at the words next to it. This works okay, but it's not the best. In the sentence "I needed some money so I got in my car and took a drive down to the bank", the word that really tells me the most about "bank" is "money", even though its far away in the sentence. What I really want is to find informative words based on their meaning.

This is what transformers and attention are for. The process works like this: For each word, I compose a "query" - in hand-wavy terms, this says "I'm looking for any other words out there that are X". X could be "related to money" or "near the end of the sentence" or "are adjectives". Next, for each word I also compute a "key", this is the counterpart of the query, and says "I have Y". For each query, I compare it to all the keys, and find which ones are most similar. This tells me which words (queries) should pay attention to which other words (keys). Finally, for each word I compute a "value". Whereas the "key" was sort of an advertisement saying what sort of information the word has, the "value" is the information itself. Under the hood, the "query", "key" and "value" are all just vectors. A query and a key match if their vectors are similar.

So, as an example, suppose that my sentence is "Steve has a green thumb". We want to understand the meaning of the word "thumb". Perhaps a useful step for understanding any noun would be to look for adjectives that modify it. We compute a "query" that says "I'm looking for words near the end of the sentence that are adjectives". When computing a "key" for the word green, maybe we compute "I'm near the end of the sentence, I'm a color, I'm an adjective or a noun". These match pretty well, so "thumb" attends to "green". We then compute a "value" for "green" that communicates its meaning.

By combining the information we got from the word "green" with the information for the word "thumb", we can have a better understanding of what it means in this particular sentence. If we repeat this process many times, we can build up stronger understanding of the whole sentence. We could also have a special empty word at the end that represents "what might come next?", and use that to generate more text.

But how did we know which queries, keys and values to compute? How did we know how to represent a word's meaning as numbers at all? These seemingly impossible questions are what is being "learned". How exactly that happens would require an equally big explanation of its own.

Keep in mind that this explanation is very fuzzy, and is only intended to convey the loose intuition of what is going on. It leaves out many technical details and even gets some details intentionally wrong to avoid confusion.

Let the model decide which words to focus on when producing a response

> I have zero AI/ML knowledge

This may make it difficult to explain and I already see many incorrect explanations here and even more lazy ones (why post the first Google result? You're just adding noise)

> Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel

First, Yegge needs to be able to tell me what Attention and Transformers are. More importantly, he needs to tell me who invented them.

That actually gets to our important point and why there are so many bad answers here and elsewhere. Because you're both missing a lot of context as well as there being murky definitions. This is also what makes it difficult to ELI5. I'll try, then try to give you resources to get an actually good answer.

== Bad Answer (ELI5) ==

A transformer is an algorithm that considers the relationship of all parts of a piece of data. It does this through 4 mechanisms and in two parts. The first part is composed of a normalization block and an attention block. The normalization block scales the data and ensures that the data is not too large. Then the attention mechanism takes all the data handed to it and considers how it is all related to one another. This is called "self-attention" when we only consider one input and it is called "cross-attention" when we have multiple inputs and compare. Both of these create a relationship that are similar to creating a lookup table. The second block is also composed of a normalization block followed by a linear layer. The linear layer reprocesses all the relationships it just learned and gives it context. But we haven't stated the 4th mechanism! This is called a residual layer or "skip" layer. This allows the data to pass right on by each of the above parts without being processed and this little side path is key to getting things to train efficiently.

Now that doesn't really do the work justice or give a good explanation of why or how things actually work. ELI5 isn't a good way to understand things for usage, but it is an okay place to start and learn abstract concepts. For the next level up I suggest Training Compact Transformers[0]. It'll give some illustrations and code to help you follow along. It is focused on vision transformers, but it is all the same. The next level I suggest Karpathy's video on GPT[1], where you will build transformers and he goes in a bit more depth. Both these are good for novices and people with little mathematical background. For more lore and understanding why we got here and the confusion over the definition of attention I suggest Lilian Wang's blog[2] (everything she does is gold). For a lecture and more depth I suggest Pascal Poupart's class. Lecture 19[3] is the one on attention and transformers but you need to at minimum watch Lecture 18 but if you actually have no ML experience or knowledge then you should probably start from the beginning.

The truth is that not everything can be explained in simple terms, at least not if one wants an adequate understanding. That misquotation of Einstein (probably originating from Nelson) is far from accurate and I wouldn't expect someone that introduced a highly abstract concept with complex mathematics (to such a degree that physicists argued he was a mathematician) would say something so silly. There is a lot lost when distilling a concept and neither the listener nor speaker should fool themselves into believing this makes them knowledgeable (armchair expertise is a frustrating point on the internet and has gotten our society in a lot of trouble).

[0] https://medium.com/pytorch/training-compact-transformers-fro...

[1] https://www.youtube.com/watch?v=kCc8FmEb1nY

[2] https://lilianweng.github.io/posts/2018-06-24-attention/

[3] https://www.youtube.com/watch?v=OyFJWRnt_AY

I cannot do ELI5, but can do ELI14 for you.

Transformer is a building block (a part) of a language model. "Language model" is an algorithm that can predict words following given words. For example, you can give a text to a model and get a summary of this text, or an answer to the question in the text, or a translation of the text.

Language models are often made of two parts - encoder and decoder. The encoder reads input text (each word is encoded as a bunch of numbers, for example, as list of 512 floating-point numbers) and produces a "state" (also a large list of numbers) which is expected to encode the meaning of the text. Then the decoder reads the state and produces the output as words (to be exact, as probabilities for every possible word in the dictionary to be at a certain position in the output).

Before Transformers, people tended to use so called "recurrent neural networks" architecture. With this approach, the encoder processes the text word by word and updates the state after every word:

    state = some initial state
    for word in text:
        state = model(state, word)

model(...) here is a complicated mathematical function, often with millions of operations and parameters.

As I have written above, after reading the text, the state should encode the meaning of the text.

But it turned out that this approach doesn't scale well with long or complicated texts because the information from beginning of the text gets lost. The model tends to "forget" what it had read before. So a new architecture, "Transformers", was proposed. The difference is that now we give entire text (each word encoded as bunch of numbers) to the model:

    state = model(input text)

Now the model processes the text at once. But implementing this naively would result in a very large model with too many parameters that would require too much memory and computing time. So developers used a trick here - most of the time each input word is processed separately from others (as in recurrent model), but there are stages, called "attention" where the words are processed together (and those stages are relatively light), so it looks like this:

    # stage where all text is processed at once
    # using quick algorithm
    state1 = attention(input text)
    # stage where each part of state is processed independently
    # with lot of heavy calculations
    state2 = map(some function, state1)
    state3 = attention(state2)
    state4 = map(some function, state3)
    ...

To summarize, in Transformers the model processes the text at once, but we have to employ tricks and split processing into stages to make calculation feasible. Probably that is why some people believe the authors should receive a reward for their work.

I think this explanation is as far as one can get without learning ML.

Also I think this thread is a good place to complain about the paper. The model is not described clearly. For example, try to find the size of input data vector for the model in the paper - it is not specified. There is also a misleading phrase

    All sub-layers in the model, as well as the embedding layers, produce outputs of dimension d_model = 512.

which makes the reader think that each block (Transformer) gets 512-dimensional vector as input and produces 512 numbers at the output. But this is wrong. 512 numbers is just a single word, not entire text or internal state. I could not understand this from reading just the original paper.

Also it is not written where do keys, queries and values for attention come from.

You guys are cracking me up lol most comments here must have the smartest damn 5-year olds in mind!

“It’s simple ! Just tokenize the context and allow differentiated hash maps to best map the embeddings matrix. Duh!”

Explain Like I'm Five... Years Into A Doctorate

From the Yegge post:

> LoRA makes LLMs composable, piecewise, mathematically, so that if there are 10,000 LLMs in the wild, they will all eventually converge on having the same knowledge. This is what Geoffrey Hinton was referring to on his SkyNet tour.

I don't think that's right at all, LoRA freeze lots of the large model part and wouldn't let you just simply combine large models. Instead. I'm pretty sure Hinton is referring to data parallel training with batching:

> DataParallel (DP) - the same setup is replicated multiple times, and each being fed a slice of the data. The processing is done in parallel and all setups are synchronized at the end of each training step.

https://huggingface.co/docs/transformers/v4.15.0/parallelism

You can have many instances of the model training on different bits of data, and then just average the modified weights back together at the end. This combining of weights is what Hinton means when he says parallel copies of brains can learn things independently and then recombine them later a huge bandwidth speeds, whereas humans are far more limited to sharing separate experiences verbally or with like a multimedia presentation or something.

Thanks to everyone in this thread for all the great links and explanations!!

I have some background but I'm probably not the best person in the world to explain.

The important thing about the transformers model is that it's the first one we have found which keeps unlocking more and more powerful and general cognitive abilities the more resources we throw at it (parameters, exaflops, datasets). I saw some interview with Ilya Sutskever where he says this; it almost certainly won't be the last or best one, but it was the first one.

Why was it the first one? How were these guys so clever and other ones couldn't figure it out?

OK so first you need some context. There is a lot of 'Newton standing on the shoulders of giants' going on here. If all of these giants were around in the 1970s, it probably would have been invented then. Heck for all we know something as good was invented in the 1970s but our computers were too smol to benefit from it. This is what John Carmack is currently looking into.

To really notice the scaling benefits of the transformer architecture, they needed to run billion parameter transformer models on linear-algebra-accelerating GPU chips using differentiable programming frameworks. These are some of the giants we are standing on. The research and development pipeline for these amazing GPUs like [thousands of tech companies -> ASML -> TSMC -> NVIDIA] didn't exist until not so long ago. The special properties of transformers wouldn't have been discovered so soon without this hardware stack.

Another giant we are standing on is the differentiable programming linear algebra libraries and frameworks similar to theano or tensorflow or pytorch or jax. They have had things like this under the name 'mathematical programming' like CPLEX but it wasn't as accessible. 'Differentiable programming' is a newish terminology for what used to be called 'automatic differentiation' where 'differentiation' means essentially the same as calculus derivative. Informally it means that these libraries can predict any tiny output effect of any tiny input change as a computationally cheap side-effect of computing the given output, even for complicated calculations. This capability makes optimization easier, in particular it generalizes the 'backpropagation' algorithm of traditional artificial neural networks.

What is the transformer model in more nerdy terms.

At one level, it's just a complicatedly parameterized function, where you can fit the parameters by training on data. This viewpoint puts the importance on the computational power applied to training the model with the advantage of differentiable programming. Some will probably guess that the details of the model architecture don't really matter as long as it has sickening amount of parameters and exaflops and dataset. Some version of this viewpoint is probably true in my opinion.

More specifically, the transformer architecture is like a chain of black box differentiable 'soft' lookup tables. The soft queries and keys and values are each lists of floating point numbers (for example a single soft query is a list of numbers, called a vector) and these vectors are stacked into matrices and the soft lookup is processed quickly with fast matrix multiplication tricks. Importantly, all of this is happening inside of a differentiable programming framework which lets you cheaply answer questions about how any small change to the input will affect the output. This capability is used for training, by making trillions of billions of tiny changes to the floating point numbers in the multiplication matrices in the boxes. At the end, the fully trained chain of black box functions can be used to compute a probability distribution over the next token in the message, which lets you generate messages or translate between languages or whatever.

It’s not really something you need to understand unless you’re an ML researcher.

I guess the ELI5 (with a BUNCH of details left out) is “Transformers: what if you didn’t have to process sentences as a sequence of words, but rather as a picture of words.”

>t’s not really something you need to understand unless you’re an ML researcher

What kind of an answer is this? No it’s something you should understand if you are at all interested.

It’s like saying “I have no basis in differential equations or complex numbers or Newtonian mechanics or experimental findings on the wavelike nature of light but could somebody explain the Schrödinger equation derivation like I’m five?”

Like, there’s a reason we don’t explain things to five year olds.

Also I have a good ELI5 answer, so I don’t know what else you want dude lol so petty

Who is Yegge and why should I care? All I found on google is that he wrote code for google and blogs.

He's been a popular tech writer for a couple of decades. He's written some very influential pieces in the past - here's a copy of one of his most famous (I don't think the original is available any more): https://gist.github.com/chitchcock/1281611

> (I don't think the original is available any more)

That's because this one was not a blog post, it was a Google+ post that he intended to share internally with Google only. He had other great rants on his actual blog though, about (vague memory here) Emacs, game development, language design, etc.

He's an engineers who's been involved (or close to) some of the most influential technologies/products in use in the last 10 years, and his blog posts are quite celebrated and respected.

Think of him as an essayist. Like Paul Graham (pg on this site)

I first heard about him after an internal post/rant of his about Google+ got leaked and went viral.

My impression is that he is a Kardashian - famous for being famous - in the software world.

Recommend

How COVID spurred drug repurposing research: A conversation with an FDA data sci...

使用 LibMan CLI 下載安裝 jQuery、Vue.js、Font-Awesome 程式庫

NodeJs 实践之他说 - 彭加李

Jamf exec commends Apple for its security and privacy measures

美光发布英睿达Pro系列内存：配纯黑马甲，涵盖DDR4和DDR5产品线 - 超能网

传长江存储NAND闪存颗粒开始涨价，SSD报价或触底反弹

The Expanse: A Telltale Series is out this July as a classic five-part episodic...

突发，马斯克卷入惊天大案！

橙APP的用户注册流程是如何做到这么流畅自然的？ - 优设网 - 学设计上优设

助力农田精准管理，大疆农业发布智慧农业平台

About Joyk