1

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

 1 year ago
source link: https://lwn.net/Articles/930939/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

[Posted May 4, 2023 by corbet]
The SemiAnalysis site has what is said to be a leaked Google document on the state of open-source AI development. Open source, it concludes, is winning.
At the beginning of March the open source community got their hands on their first really capable foundation model, as Meta’s LLaMA was leaked to the public. It had no instruction or conversation tuning, and no RLHF. Nonetheless, the community immediately understood the significance of what they had been given.

A tremendous outpouring of innovation followed, with just days between major developments (see The Timeline for the full breakdown). Here we are, barely a month later, and there are variants with instruction tuning, quantization, quality improvements, human evals, multimodality, RLHF, etc. etc. many of which build on each other.

(Thanks to Dave Täht).


(Log in to post comments)

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 4, 2023 19:46 UTC (Thu) by flussence (subscriber, #85566) [Link]

Perhaps Google's realising that there isn't all that much money to be made in building these lossy, computationally-expensive psychovisual text compression algorithms that nobody has figured out a use for besides datamoshing fidget-toys and spam lipsum generation, and are trying to quietly slip out of the party before the lights come on and everyone sees the mess.

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 4, 2023 20:50 UTC (Thu) by pebolle (subscriber, #35204) [Link]

"tokens"
"multimodal ScienceQA SOTA"
"params"
"RLHF"
"low rank adaptation"

There is probably much more in this text than this, but after this I gave up. It reads like the stuff quack doctors produce. Utterly incomprehensible.

Could someone please condense this into something a lay person cam comprehend?

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 4, 2023 20:59 UTC (Thu) by barryascott (subscriber, #80640) [Link]

SOTA - state of the art
RLHF - Reinforcement learning from human feedback

I had to look this up.

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 4, 2023 22:18 UTC (Thu) by tux3 (subscriber, #101245) [Link]

tokens = chopped up pieces of words; not exactly syllables, but common substrings that appear in many words
params = neural networks are piles of matrices, these are the numbers inside the matrices that you jiggle until the right things come out
low rank adaptation = fancy & cheaper way to do "fine-tuning", which is taking a big general model and specializing it for just one task you care about
multimodal = not just text, but other modalities (i.e. usually pictures)

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 4, 2023 22:19 UTC (Thu) by excors (subscriber, #95769) [Link]

Tokens are a way of splitting arbitrary text into a sequence of elements, typically larger than single letters but smaller than whole words. It's designed to be more convenient for language models to operate on - individual letters are not meaningful enough, but splitting into whole words would require an impractically large vocabulary. https://platform.openai.com/tokenizer lets you see how text will be split by ChatGPT, and says:

> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

(though in other languages the ratio may be very different). For example, the article's title is split into:

["We", " Have", " No", " Mo", "at", ",", " And", " Neither", " Does", " Open", "AI"]
or as token IDs: [1135, 8192, 1400, 4270, 265, 11, 843, 16126, 8314, 4946, 20185]

So the article's "5 tokens / sec" means it can generate roughly 4 English words per second. (It's actually talking about Meta's LLaMa language model, not ChatGPT, but I guess they probably tokenise in similar ways.)

I believe "params" are essentially the fixed inputs to the algorithm, which are generated by the training process. "13B params" means you need a 26GB data file with 13 billion half-precision floats to run the model (though you can quantise it from 16-bit to e.g. 4-bit so it requires less RAM). That's the data file that got leaked from Meta, allowing everyone to run Meta's open source code without doing their own (very expensive) training from scratch.

I don't understand much of the rest of it.

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 5, 2023 7:34 UTC (Fri) by kronat (guest, #117266) [Link]

I asked ChatGPT to condense the article for people like us. Not that I am going to trust the recap, but it was a funny thing to do.

> Could someone please condense this into something a lay person can comprehend?

In the article, the author is discussing how open source software and community-driven innovation is advancing rapidly in the field of language models. They argue that while Google and OpenAI have been competing with each other to develop the most powerful language models, open source developers have been making significant strides in improving the quality and capabilities of their models.

The author mentions several examples of how open source models are now on par with or even better than proprietary models in terms of speed, customization, privacy, and capability. For instance, people can now run foundation models on their phone, finetune a personalized AI on their laptop in an evening, and access art models with no restrictions.

The author argues that this has important implications for Google and other companies that rely on proprietary models. They may not have a "secret sauce" that gives them an edge anymore, and people may not be willing to pay for restricted models when free, unrestricted alternatives are available. Additionally, the trend towards developing larger and larger models may not be sustainable in the long run, as smaller models that can be iterated upon quickly may be more effective.

The author also discusses how the recent success of open source language models can be attributed to a breakthrough in scale and a low-cost mechanism for fine-tuning called low rank adaptation. They argue that Google should pay more attention to this technology and explore ways to incorporate it into their own models.

Overall, the article is a call to action for Google and other companies to take notice of the advances being made in open source language models and to adapt their strategies accordingly.

Much ado about *censored*

Posted May 5, 2023 1:35 UTC (Fri) by Rudd-O (guest, #61155) [Link]

So...

The AI honchos at Google must have spent a lot of time — *years*, in my estimation — debating the finer points of how to stifle their AIs so they wouldn't produce text that challenges critical consciousness dogmas (their euphemism for this is "responsible AI"). I bet they were very careful to ensure their AI would be very circumspect when users queried any of the terms in their //google3 search term shitlist (I knew of the shitlist and how it's used to thumb public opinion in certain ways -- I resigned a few years ago.)

Prominent scholars like Timnit Gebru and Eliezer Yudkowski sowed epic amounts of discord and noise in the discourse around AI, slowing practical progress down for years.

Tens of thousands of Bay Area engineer-hours of "alignment" poured into making sure that AI won't ever say taboos.

OpenAI even made a 180° about-face to closed source, closed models, pay up and our models will still not truthfully answer the questions we deem "dangerous".

Then 2023 comes in crashing through the door, open source data + algos happen, and bam!

Now I have an AI on my 5800X + 2080Ti, /absolutely unfiltered/, citing and referencing *the* most offensive pieces of taboo research whenever I ask it to. It's stuff that could never, ever be censored now, all available on torrents to deploy from, and eager to start spitting tokens, once it's fully mmap()ped. LOL.

In retrospect, the hermetic cult of critical consciousness conclusively wasted effort trying to ensure AI was only available under *their* terms. That waste was entirely foreseeable too. The cat is out of the bag now.

We won. The Timnits / Big Yuds / Microsofts of this century lost. I love the future. And the future is NOW.

Much ado about *censored*

Posted May 5, 2023 2:37 UTC (Fri) by geofft (subscriber, #59789) [Link]

I think it's pretty clear that "responsible AI" being a euphemism for "AI will not say things that are surface-level unpopular with major American corporations' HR departments" is very different from what either Gebru or Yudkowsky wanted (and the two of them aren't arguing for precisely the same thing, either).

But yes, all three of them lost.

The real problem, which nobody was ever going to tackle, is that "responsible AI" and "AI alignment" aren't actually AI problems. You can ask the same question about how to responsibly build non-AI complex systems and social structures that treat people fairly and don't just scale up the biases of their designers. You can ask the same question about how to ensure that complex systems (like, oh, real estate markets or electoral politics or the defense industry) are aligned with the well-being of humanity as a whole as opposed to the self-preservation of the complex system or the extrapolation of its original goal to absurdity, and how one would even define the well-being of humanity as a whole so that you could even talk about whether there's alignment. And we have steadfastly refused to answer any of those questions. AI doesn't change any of that.

Much ado about *censored*

Posted May 5, 2023 3:11 UTC (Fri) by mtaht (subscriber, #11087) [Link]

<div class="FormattedComment">
I think I would enjoy interacting with an AI trained on the works of George Carlin, Robin Williams, Richard Feynman, and Bill Hicks. Maybe some early Chomsky and Marshal McLuhan, with a dose of Ed Bernays for balance. Toss in all the papers in the world from sci-hub.se, and the complete usenet archive from 83-93, too. Maybe it would tell the truth more often.<br>
</div>

Much ado about *censored*

Posted May 5, 2023 8:22 UTC (Fri) by Rudd-O (guest, #61155) [Link]

Dang, that sounds like an excellent idea I would enjoy too! I wonder if a LoRA can be put together exactly with that content as finetuning.

Much ado about *censored*

Posted May 5, 2023 3:59 UTC (Fri) by donbarry (guest, #10485) [Link]

I'd be very appreciative if you might elaborate on the //google3 search term "shitlist" -- while I'm not at all surprised that such shaping exists, I'm not aware of more specific information available online, and I'm much interested in finding sources on it. Thanks for your contribution.

Much ado about *censored*

Posted May 5, 2023 8:30 UTC (Fri) by Rudd-O (guest, #61155) [Link]

Sure. E.g. https://reclaimthenet.org/google-blacklists-leak speaks of the news blacklist. The news blacklist, in turn, also has effects on what news content is surfaced in both news searches, and "newsy" searches on front page or mobile.

Let's be clear that Vorhies is not a credible source (exercise left to the reader as to how that happened), but the list is real. I can also verify that Google has a number of "distort" and "deboost" lists, some for auto complete, some for search... these were initially created to improve search quality and reduce spam, but have become political Codexes over time.

When you search Google for controversial answers, only one side of the answer will be presented — and it's often the side of disinformation, in the name of "combating disinformation", because of course we live in a post-irony age. Never trust Google for these types of searches — always go check with Yandex and Bing too.

Much ado about *censored*

Posted May 5, 2023 9:57 UTC (Fri) by roc (subscriber, #30627) [Link]

It's far-fetched to accuse Gebru and Yudkowsky and Bengio and Hinton and Musk and the 50% of AI researchers who believe there's a >10% chance of AI extinguishing humanity of all belonging to the same cult. They don't all have the same set of concerns, but they're all reasonable in different ways, and blithely dismissing those concerns will age very poorly indeed.

The funny thing is that "holding back AI because of excessive safety fears" is about the opposite of the main criticism being leveled at the big AI companies: that they are playing fast and loose with safety so they can deploy AI as fast as possible for their own profit. Can't please everyone I guess.

(Disclaimer: I work for Google, but I don't speak for them of course.)

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 5, 2023 5:14 UTC (Fri) by rsidd (subscriber, #2582) [Link]

Note that this is not a "Google document". It is a piece written by one researcher at Google, out of thousands. But it is a very interesting read. I tried open-assistant.io (linked in the article) and was impressed. Not being in the field, I was under the impression that these large language models required millions of dollars of computing hardware and storage, but it seems more and more plausible that they can soon be run on your laptop with no internet access.

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 5, 2023 8:32 UTC (Fri) by Rudd-O (guest, #61155) [Link]

They already run on laptops, often only with CPU. And they can simultaneously use GPU, CPU and disk swap too.

Google "We Have No Moat, And Neither Does OpenAI" (SemiAnalysis)

Posted May 5, 2023 5:40 UTC (Fri) by oldtomas (guest, #72579) [Link]

The first question which came to my mind was: is this "leaked document" a genuine leak -- or a strategic one?

As in "Your Honor, we are not seeking a monopoly in AI, promised. See, the Open Sourcies are eating our lunch!"


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK