7

Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real

 9 months ago
source link: https://news.ycombinator.com/item?id=38596953
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real

The "magic" of the fake Gemini demo was the way it seemed like the LLM was continually receiving audio + video input and knew when to jump in with a response.

It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway through a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue. The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.

I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural, and much more advanced than current LLMs.

-----------------

Checking the source code, the demo takes screenshots of the video feed every 800ms, waits until the user finishes taking and then sends the last three screenshots.

While this demo is impressive, it kind of proves just how unnatural it feels to interact with an LLM in this manner when it doesn't have continuous audio-video input. It's been technically possible to do kind of thing for a while, but there is a good reason why nobody tried to present it as a product.

s.gif
100%.

I made this demo in 2-3 hours, and I did use the "wait until the dictation results are finalized" technique which is safer (i.e. the dictation transcription is more robust) but slower.

For another demo - https://www.youtube.com/watch?v=fxS7OKh_4vc - I kept feeding the "in progress" transcription results into GPT and that was super super awesome & fast. It would just require more work to deal with all of the different timings going on (i.e. there's the speech itself from the person, the time to transcribe, sending the request to GPT, "sync'ing" it to where the person is (mentally/in their speech) at the point where GPT replies, etc.)

But yeah. Real time/continuous talk is absolutely where it's at. Should GPT be available as a websocket...?!

s.gif
I have a rough demo of real time continuous voice chat here, ~1 second response latency: https://apps.microsoft.com/detail/9NC624PBFGB7

Basically it starts computing a response every time a word comes out of the speech recognizer, and if it is able to finish its response before it hears another word then it starts speaking. If more words come in then it stops speaking immediately; in other words, you can interrupt it. It feels so much more natural in conversation than ChatGPT's voice mode due to the low latency and continuous listening with the ability to interrupt.

There are a lot of things that need improvement. Most important is probably that the speech recognition system (Whisper) wasn't designed for real time and is not that reliable or efficient in a real time mode. I think some more tweaking could improve reliability considerably. But also very important is that it doesn't know when not to respond. It will always jump in if you stop speaking for a second, and it will always try to get the last word. A first pass at fixing that would be to fine tune a language model to predict whose turn it is to speak.

There are also a lot of things that this architecture will never be able to do. It will never be able to correct your pronunciation (e.g. for language learning), it will never be able to identify your emotions based on vocal cues or express proper emotions in its voice, it will never be able to hear the tone of a singing voice or produce singing itself. The future is in eliminating the boundaries between speech-to-text and LLM and text-to-speech, with one unified model trained end-to-end. Such a system would be able to do everything I mentioned and more, if trained on enough data. And further integrating vision will help with conversation too, allowing it to see the emotions on your face and take conversational cues from your gaze direction and hand gestures, in addition to all the other obvious things you can do with vision such as chat about something the camera can see or something displayed on your screen.

s.gif
Do you have a version that doesn't need Windows and/or a Microsoft account? Or an uncut video of someone using it?
s.gif
What's the horizon after which you reset the input instead of appending to it? Does that happen if the user lets the system finish speaking?
s.gif
Great question. Right now that happens, somewhat arbitrarily, if the user lets the system finish speaking the first sentence of its response. If the user interrupts before that, then it's considered a continuation of the previous input. If the user interrupts after that, it's still an interruption (and, importantly, the language model's response must be truncated in the conversation context because the user didn't hear it all), but it starts a new input to the LLM. This could be handled better as well. Basically any heuristics like this that are in the system should eventually be subsumed into the AI models so that they can be responsive to the conversation context.
s.gif
This would be super cool with Mistral on local machine
s.gif
As a deaf person I have been watching "live" speech recognition demos for 20-30 years. All look great. Using it in day to day life is crazy cause if you have 1 mistake per 10 words it builds up over time to be supremely annoying.
s.gif
Yeah my friend and I were just talking about continuous stream input multimodal LLMs. Does anyone know if there is a technical limitation preventing continuous stream input data? Like it’s listening to you practice guitar and then when you get to a certain point it says “okay let’s go back and practice that section again”. It seems the normal approach of next token prediction falls flat when there is a continuous stream of tokens and it only sometimes needs to produce output.

What is that type of input called in the literature and what research has been done on it? Thanks!

s.gif
At a purely technical level, no, as long as the model can output a null token. E.g. imagine training using a transcript of two people talking. What would be a single text token is a tuple of two tokens, one per person. Each segment where a person is not talking is a series of null tokens, one per ‘tick’ of time. In an actual conversation, one token in the tuple is user input and one is GPT prediction. Just disregard the user half of the tuple when determining whether the GPT should ‘speak’.

The real world challenge is threefold. First, null tokens would be massively over represented in training and by extent, in outputs. Second, at a computational level, outputting a continuous stream of tokens would be absurdly expensive. Third, there is not nearly as much training data of interspersed conversations as of monologues (e.g. research papers, this comment, etc.).

s.gif
I think you should be able to do it out of the box if you just keep sending the tokens, and after that ask the GPT "is there a mistake? Respond with just "yes" or "no". Why does there have to be something like a "null" token?

However it might seem expensive yes, but at least it only has to respond with one token.

s.gif
There’s a null token because the question was about you not having to ask if there was a mistake. It would just default to constantly producing a null token until it had a real response
s.gif
Yeah it seems the notion of time is sort of not built in conceptually to current systems. You could pick a fixed time constant like 0.1 seconds or 1 second, but it's clear that it's sort of missing something more fundamental.
s.gif
I think if the same LLM were trained on audio and video input instead of text, and produced audio output, including silence tokens, then the notion of time would get "built in". Audio continuation without translation to text has been shown to work. Mixing it with text is also possible. But all this would require a massive network that maybe even be difficult for the world's biggest companies to train and serve at any kind of scale. So it's more of an engineering problem than a theoretical one imho.

Also imho, I think until the context/memory problem is fully solved we won't really see the AI as having any kind of agency. But continuous, low latency interaction would certainly feel like a step towards that.

s.gif
I think probably training on pause tokens or something similar would be the key to something like this. Maybe it's not even necessary. Maybe if you just tell GPT-4 to output something like .... every time it thinks it should wait for a response (you wouldn't need to wait for the user to finish then), things would be a lot smoother.
s.gif
Yes, you could probably fine-tune (or even zero-shot) a LLM to handle the "knowing when to jump in" use case.

The real problem is that it's simply too computationally expensive to continually feed audio and video into it one of these massive LLMs just in case it might decide to jump in.

I was wondering if you could train a lightweight monitoring model that continually watching the audio/video input and only tried to work out when the full-sized LLM might want to jump in and generate a response.

s.gif
As the human brain is a clump if regions all interconnected and interacting, for example, one may focus their attention elsewhere until their name is called, having a ight model wait for an important queue makes sense more than fiscally.

One time I was so distracted, I missed an entire paragraph someone said to me, walked to my car, drove away, and 5 minutes later processed it.

s.gif
Yeah, one thing I've noticed myself do is that when I'm focused on something else and someone suddenly gets my attention I'll replay the last few seconds of the conversation in my head to get context on what was being talked about before I respond. That seems pretty trivial to do with a LLM; it doesn't need to be using 100% of its "brainpower" at all times.
s.gif
I wanted to plug a GPT4 chatbot into a group chat, so it could react to what people said. In the end I abandoned the idea because it was too hard for me to figure out when it should talk vs let people talk between them.
s.gif
Couldn't you instruct the model to only say something when it is important or when it's being addressed directly, and otherwise output just empty response which isn't rendered?
s.gif
For now LLM can only answer, but they will soon be able to prompt YOU.

True conversation is going to be very interesting.

s.gif
One easy improvement would be to stop the video capture automatically via a combination of silence detection and motion detection
I don't get why companies lie like this. How much do they have to gain? It seems like they actually have a lot to lose.

What's crazy to me is that these tools are wildly impressive without the hype. As a ML researcher, there's a lot of cool things we've done but at the same time almost everything I see is vastly over hyped from papers to products. I think there's a kinda race to the bottom we've created and it's not helpful to any of us except maybe in the short term. Playing short term games isn't very smart, especially for companies like Google. Or maybe I completely misunderstand the environment we live in.

But then again, with the discussions in this thread[0] maybe there's a lot of people so ethically bankrupt that they don't even know how what they're doing is deceptive. Which is an entirely different and worse problem.

[0] https://news.ycombinator.com/item?id=38559582

s.gif
Because the same day they released the video our CEO was messaging me saying we have to get on Google's new stuff because it's so much better than GPT-4.

I said I was skeptical of the demo but, like all developments in the field, will try it out once they release it.

s.gif
I have a daily call with my CEO and I'm counting down until he mentions Googles demo and asks why we're not using their AI technology. Then it will take an exorbitant amount of energy on my part to get him to understand how that demo was doctored.
s.gif
This also seems like equally poor decision making. Wouldn't you want to at least try things out before you make a hard switch? Chasing hype is costly.
s.gif
The important thing for Google is to be part of the short list right now before adoption crystalizes. Over the next year or two a large chunk of early (late?) adopters will firmly commit to one (or maybe two) vendors for their generative AI and those decisions will be sticky.

So now is the time to do whatever it takes to get into the conversation ... which Google successfully did I think.

s.gif
Welcome to IT.... no seriously, this is how a lot of executives behave in IT.
s.gif
Google stock rallied 5% ish the after the demo (though the stock didn’t move immediately). Then it gave back about 1% once the news broke that it was faked
s.gif
That's not a great answer. We need to know the counterfactual question of "How much would google's stock have rallied after a realistic demo was given?" I would not have been surprised if the answer was also 5%. Almost certainly Google's stock would have risen after announcing Gemini. There are other long term questions too like about how this feeds into growing dissent against Google and trust. But what the stock did is only a small part of a much bigger and far more important question.

Edit: Can someone explain the downvotes? Is there a error in my response? I'm glad to learn but I'd appreciate a bit better feedback signal so that I can do so better instead of guessing.

s.gif
Economist here who studies exactly this type of counterfactual analysis. You are completely right: the effect of Gemini can only be estimated if we factor in what the Alphabet stock price would have been in the same time but in a world without Gemini. This is actually very standard in financial economics. This type of effect can be calculated with econometric techniques that compare before/after for “treated” vs “untreated” units, but in instances such as these, where only one or a few units were affected, like Alphabet stock amongst hundreds of other companies, one could use techniques such as “synthetic controls”. The intuition is to use other companies’ data to estimate before Gemini how Alphabet stock prices move over time, and then use that relationship to estimate a post-Gemini version of no-Gemini Google. The difference between the actual stock price and that counterfactual is the effect of interest; whether it is a significant effect or just random noise can be established by a number of auxiliary statistical tests. For more info, see [0].

[0] Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.

s.gif
Well I don't mean with/without Gemini, I mean without the deceptive marketing of Gemini vs had they counterfactually produced a non-deceptive marketing. Other than that nitpicking, I appreciate the backup and the source. Counterfactual testing is by no means easy, so good luck with your work! My gf is an economist but on the macro side. You guys have it tough. I'm a bit confused why people are disliking my comment but mostly that they are doing so without explanation lol.
s.gif
Broad investor community was spooked by Gemini Ga being delayed to Q1 so this stunt was a good stop gap / distraction
s.gif
And likely caused more long term harm since if they had to fake this they’re likely further behind
s.gif
That's tomorrows problem, not todays. They are hoping to solve the issue by then, or find a new way to fake it another quarter.
s.gif
Also pretty critically it moves the reckoning past the Q4/Q1 boundary, where annual budgets and performance reviews are being decided. That could be the difference between layoffs vs. hiring in your department, between departures vs. high morale, or between re-orgs vs. continued productivity. If you can fake it for an extra couple weeks, it might mean getting another year of continued funding, and with an extra year of continued funding you might just deliver what you originally promised.
s.gif
Haha, this is the right Realpolitik interpretation for sure
s.gif
> I would not have been surprised if the answer was also 5%

Given the sensitivity of the market with any company perceived to be lagging behind AI developments, I wouldn't be surprised that their stock price would drop by a couple % if the demo underwhelmed.

So the whole thing is basically a ethical question of "how far can we go with polishing our demo until it becomes an unacceptable fake?" In an ideal world you'd never want to be caught in such a dilemma.

s.gif
One question I've long had is why short-term changes in stock prices seem to matter so much to companies like these. Is it just that the short-term changes are seen as harbingers of longer-term trends, or is there a concrete reason to play games to get temporary boosts to stock price?
s.gif
I've been using the term Goodhart's Hell to describe what I see as a system of metric hacking and shortsightedness. I do think there's systems encouraging people to meet metrics for metrics sake but not stopping to understand the metric or what it actually measures, and most importantly, how that differs from the actual goals. Because all metrics are proxies and far from complete, even for things that sound really simple. I think this is because we've gotten so good at measuring that we've innately forgotten about the noisiness of the system where when we weren't good at measuring that was just forced upon us in a clear manner.

But I really don't get it either. One of the things that really sets humans apart from other animals is our capacity to perform long term planning and prediction. Why abandon that skill instead of exploiting it to its maximum potential?

s.gif
A CEO's/company's job is to maximize instantaneous shareholder value. Anything else is a waste of time, since investors are assumed to take long-term vs short-term risk preference into their own hands. The company is essentially a machine that investors can dip into and dip out of at any time, so it doesn't make sense to make decisions to move the stock price over a pre-planned certain time horizon. The reason companies invest in any long-term projects at all is because the net present value of those projects affects the stock price.
s.gif
That certainly feels like how they're acting, but what about the actual incentive structures cause it to be this way? Why does the company benefit from being a good machine for generating short-term profits for short-term investors?
s.gif
This is what I'm trying to get at with my questions. I think it is easy to give an answer of short term thinking and move on. But that's not a real answer. I want to understand the incentive structures that have led to this (and this "Goodhart's Hell") situation and if it's unstable as I presume.

There's a specific concern because of the large sentiment that the big difference between industry and academia is that in industry your products have to actually work. Ignoring the weird premise and ignoring the different TRL context, I'm not convinced that industry needs to actually make usable products. Aren't we also all complaining about how shit isn't working adequately? Google failing at seemingly simple things, and decreased quality of search. Amazon being a shitty monopoly and exploiting that to be anti-consumerism and not dealing with obvious spam and product manipulation that could be detected by a Naive Bayes filter. Or Twitter being overrun with spam bots that also could trivially be detected from a Naive Bayes filter or my block list, but blocking them actually decreases the visibility of my tweets so I'm actively encouraged by the platform to let these bots exist and follow me and like my comments.

And why would investors want this? I don't buy that there's an exclusive desire for quarterly profits and that Wallstreet does look for a diversification of their portfolios for long term blue chip stocks as well as short term gambles. That it'd require absolute insanity for a CEO or board to allow short term incentives to drive a well established company.

I really do think there's something going on but we're afraid to ask the deeper questions because we don't know the answers but I want to be encouraging that discussion even if it is thinking out loud. But maybe that insanity exists and this frustration is a result of a demonstration of it. But I don't want to believe that because I think humans are capable of so much more. That even the average person is better than an LLM but there's just issues of communication. Because even idiocracy is driven by incentive structures so I don't buy the "lol people dumb" argument, even if using better words and wrapped up in a nice bow.

s.gif
Short term changes definitely affect medium to long terms' price. Because at one level stock price is more like casino and isn't actually related to the company's performance. e.g. See the 5 year history of gamestop. Its price once increased due to random activity from redditors and its stock price is still increased due to that.
s.gif
It made me feel, more than I have ever felt before, that Google is now run by non-technical business people who don't seem to understand that many people who have at least some awareness of how this technology works, and so are probably going to be part of the decision-making process on whether to use it and other Google products, can immediately see that it is faked and are often the type of people who react very negatively to such deceptive practices.
s.gif
> What's crazy to me is that these tools are wildly impressive without the hype.

My wife and I were talking about this yesterday, and I made this exact point! I told her I’m convinced Google was deceptive like this for the Wall Street crowd and normies, because to techies and researchers who actually understand AI, the extra BS is unnecessary if the technology is legitimately impressive.

s.gif
Google screws up every business opportunity, including wantonly buying small successful businesses and killing them. Dishonesty is a fundamental part of the company.
s.gif
Relatedly, I saw in the thread that people call these types of deceptions as “smoke and mirrors” or “dog and pony show”. What happened to “Potemkin”?!
s.gif
The nice thing about "Potemkin" is that there's a decent chance the video was also designed to fool their own CEOs (in response to an impossible request), just as the Potemkin Villages were used to fool the country's own ruler.
s.gif
I had never previously heard of that term but it does seem apt. I think idioms are often more cultural and can change rapidly while one might seem ubiquitous in your group it isn't in another. Another term I think might be apt, but a bit less so, is snake oil or snake oils sells man.
s.gif
Perhaps this is revealing my ignorance, but I've never even heard of potemkin before
s.gif
If you grew up post USSR era it has probably fallen out of the lexicon for younger folk…
s.gif
Here is the headline Business Today published just in case you wonder why businesses do this

"Google Gemini Outperforms Most Human Experts & GPT-4 I Artificial intelligence I Google’s DeepMind".

It's all marketing. Same reason why satya publically postes sama + others are joining a new team at MSFT to continue should the openai thing not work out.

s.gif
I'm not sure how that is really responding to my question with an explanation. I'm well aware that its marketing and I'd hope my comment makes that clear. The question is why oversell the product, and frankly by a lot. Because people are going to find out, I mean the intent is that they use it after all.

I'm sure the marketing team can come up with good marketing that also isn't deceitful. The question is why pull a con when you already got something of value that customers would legitimately buy?

s.gif
> I'm well aware that its marketing and I'd hope my comment makes that clear. The question is why oversell the product, and frankly by a lot.

Most marketing sells the dream, not the reality. There are just many shades of grey (although 50 tends to sell well).

s.gif
I'm still not sure how that is responding to my comment. Have I said something that makes me seem naive of the existence of snake oil salesmen? I'm actually operating under the assumption that my comment, and especially followup, explicitly demonstrate my awareness of this.
s.gif
Because it may significantly delay those customers from buying the other competing product. Yea, Google has something of value, but OpenAI seems to have something of more value and Google is frantically trying to keep OpenAI from eating the whole market.
s.gif
> I don't get why companies lie like this.

The answer is always “money”. All you have to do is think “what line of thought would lead someone to believe that by lying in this manner they’ll either lose less money or make more of it?”

s.gif
> Playing short term games isn't very smart, especially for companies like Google. Or maybe I completely misunderstand the environment we live in.

It could be the principal-agent problem. The agent (employee and management) is optimizing for short-term career benefits and has no loyalty to Google's shareholders. They can quit after 3 years, so reputation damage to Google doesn't matter that much. But the shareholders want agents to optimize for longer-term things like reputation. Aligning those incentives is difficult. Shareholders try with good governance and material incentives tied to the stock price with a vesting schedule, but you're still going to get a level of disalignment.

I suppose this is where a cult-like culture of mission alignment can deliver value. If you convince/select your agents (employees) into actually believing in the mission, alignment follows from that.

s.gif
Yeah I think that makes some sense. But you would think the CEO and top execs of the company would be trying to balance these forces rather than letting one dominate. You need pressures for short term but you can't abandon long term planning for short term optimization. Anyone who's worked with basic RL systems should be keenly aware of that and I'm most certain they teach this in business school. I mean it's not like these things don't come up multiple times a year.
s.gif
There's some other explanations too. Maybe they thought the deception would fly under the radar, so it was rational according to cost-benefit analysis given available information. Maybe they fell for the human psychological bias of overvaluing near-term costs/benefits and undervaluing long-term costs/benefits. Maybe some deception was used internally when the demo was communicated to senior execs. Maybe the ego of being second place to OpenAI was too much and the shame avoidance/prestige seeking kicked in.
s.gif
I think it's because, while I think these LLMs are incredibly interesting and can be very useful, they're less than what the hype is and the valuations are based on the hype.
Thank you for creating this demo. This was the point I was trying to make when the Gemini launch happened. All that hoopla for no reason.

Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (https://github.com/haotian-liu/LLaVA). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).

s.gif
Update: For anyone else facing the commercial use question on LLaVA - it is licensed under Apache 2.0. Can be used commercially with attribution: https://github.com/haotian-liu/LLaVA/blob/main/LICENSE
s.gif
The code is licensed under Apache 2.0, but the weights are CC BY-NC 4.0 according to the README, so no commercial use unfortunately.
s.gif
It's so great. I've been this vision model to rename all the files in my Pictures folder. For example, the one-liner:
    llamafile --temp 0 \
        --image ~/Pictures/lemurs.jpg \
        -m llava-v1.5-7b-Q4_K.gguf \
        --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
        --grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
        -p $'### User: What do you see?\n### Assistant: ' \
        --silent-prompt 2>/dev/null |
      sed -e's/ /_/' -e's/$/.jpg/'
Prints to standard output:
    a_baby_monkey_on_the_back_of_a_mother.jpg
This is something that's coming up in the next llamafile release. You have to build from source to have the ability to use grammar and --silent-prompt on a vision model right now.

Weights here: https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main

Sauce here: https://github.com/mozilla-Ocho/llamafile

s.gif
Truly grateful for your work on cosmopolitan, cosmo libc, redbean, nudging POSIX towards realizing the unachieved dream and also for contributing to llama.cpp. It’s like wherever I look, you’ve already left your mark there!

To me, you exemplify and embody the spirit of OSS, and to top that - you seem to be just an amazing human. You are an inspiration for me and many others. And even though I know I’ll never ever get close, you make me want to try. Thank you. :)

s.gif
That's cool! I've been a fan of your projects here since redbean was released, and if I understood C I would be more excited about the underlying tech that runs all these tools, but I'm more of an algorithm designer and back-end data processing system programmer (I use Python), so watching the progression of your technology is very impressive but I barely understand how it works :)
I am now convinced that Google Deepmind really had nothing in terms of SOTA LLMs. They were just bluffing.

I remember when chatgpt was released Google was saying that they had much much better models that they are not releasing because they for AI Safety. Then theu released palm and palm 2 saying that it is time to release these models to beat ChatGPT. It was not a good model.

The they hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model. So this is it.

So in one year, we went from Google has to have the best model, they just do not want to release to they have the infrastructure and data and the talent to make the best model. Why they really had was nothing.

haha yes it was entirely possible with gpt4v. literally just screenshot and feed in the images and text in chat format, aka “interleaved”. made something similar at a hackathon recently. (https://x.com/swyx/status/1722662234680340823). the bizarre thing is that google couldve done what you did, and we wouldve all been appropriately impressed, but instead google chose to make a misleading marketing video for the general public and leave the rest of us frustrated nerds to do the nasty work of having to explain why the technology isnt as seen on tv yet; making it seem somehow our fault

i am curious about the running costs of something like this

s.gif
I made 77 requests to the GPT-vision API while developing/demo'ing this, and that resulted in a $0.47 bill. Pretty reasonable!
s.gif
Hi Greg,

Congratulations, great demo! The $0.47 bill seems reasonable for an experiment, but imagine someone doing a task of this complexity as a daily job - let's say 100x times, or a little more than 4 hours - the bill would be $47/day. It feels like there's still an opportunity for a cheaper solution. Have you or someone else experimented with e.g. https://localai.io/ ?

s.gif
if i did not have your comment history i'd have sworn you worked for localai.io
I’ve recently been trying to actually use Google’s AI conversational translation app that was released awhile back and has many updates and iterations since.

It’s completely unusable for real conversation. I’m actually in a situation where I could benefit from it and was excited to use it because I remember watching the demo and how natural it looked but was never able to actually try it myself.

Now having used it, I went back and watched their original demo and I’m 100% convinced all or part of it was faked. There is just no way this thing ever worked. If they can’t manage to make conversational live translation work (which is a lot more useful than drawing a picture of a duck) I have high doubts about this new AI.

Seems like the exact same situation to me. It’s insane to me how much nerve it must take to completely fake something like this.

[tangential to this really cool demo] JPEG images being the only possible interface to GPT-4 feels wasteful. the human eye works the delta between "frames", not the image itself. I wonder if the next big step that would allow real-time video processing at high resolutions is to have the model's internal state operate on keyframes and deltas similar to how video codecs like MPEG work.
s.gif
When Google talks about Gemini's "multi-modal", they include "video" in the list of modes. It's totally possible they don't actually mean video, and just mean frames like in this demo. They haven't elaborated on it anywhere that I've seen.
s.gif
Their technical report clarifies that video is just a sequence of frames fed as images.
Lol at choosing the name Sagittarius, which is exactly across from Gemini in the Zodiac
s.gif
I remember there was speculation that Facebook named their vaporware cryptocurrency Libra (later, “Diem”) as a jab at the longtime rival Winklevoss twins, who had started a crypto exchange called Gemini. I have no idea how astrologically clever that would be.
s.gif
Libra is one of two other air signs - the other is Aquarius. They're 4 months offset from Gemini.
The latency is excusable as this is through the API. Inference on local infrastructure is almost instant so this demo would smoke everything else if this dude had access.
I am now convinced that Google DeepMind really had nothing in terms of state-of-the-art language models (SOTA LLMs). They were just bluffing. I remember when ChatGPT was released; Google was saying that they had much better models they were not releasing due to AI safety. Then they released Palm and Palm 2, saying it's time to beat ChatGPT with these models. However, it was not a good model.

They then hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model.

Sundar's code red was genuinely alarming because they had to dig deep to make this Gemini model work, and they still ended up with a fake video. Even if Gemini was legitimate, it did not beat GPT-4 by leaps and bounds, and now GPT-5 is on the horizon, putting them a year behind. It makes me question if they had a secret powerful model all along

Great demo, I laughed at the final GPT response too.

Honestly: it would be fun to self-host some code hooked up to a mic and speakers to let kids, or whoever, play around with GPT4. I’m thinking of doing this on my own under an agency[0] I’m starting up on the side. Seems like a no-brainer as an application.

[0]: https://www.divinatetech.com

Snader's Law: "Any sufficiently advanced technology is indistinguishable from a rigged demo."
I had been working on an idea for an interface "Sorting Hat" system to help kids at schools know whether something was for trash, compost, or recycling. While I had been hacking on it for a bit, Greg's "demo" was much better integrated than what I could do, so thanks Greg!

I did add ElevenLabs support to make it a little more snazzy sounding...

So, here it is the "Compose/Trash/Recycle Sorting Hat, Built on Sagittarious" https://github.com/n8fr8/CompostSortingHatAI

You can see a realtime, unedited YouTube demo video of my kid testing it out here: https://www.youtube.com/watch?v=-9Ya5rLj64Q

Wow, this is super cool! From the code it seems like the speech to text and text to speech are using the browser’s built-in features. I always forget those capabilities even exist!
Looks like, again, this doesn't have GPT-4 processing video as much as a stack of video frames, concatenated and sent as a single image. But much closer to real!
s.gif
I just found out it gets worse: turns out GPT-4 isn't processing images so much as arrays of pixels!

And worse: turns out GPT-4 isn't processing pixels so much as integers representing in a position in some color space like RGB!

And worse! turns out GPT-4 isn't processing integers so much as series of ones and zeroes!

Now that this is public knowledge, I'm willing to bet this was the ugly "less than candid" truth that the board sacked Sam Altman over.

s.gif
There is a significant difference. Video has a temporal component (frames tend to be correlated with previous ones), and vision LLMs do not have some sort of hidden states to keep track of that temporal component.

Using captions to bridge this only works to a certain extent (you're giving text descriptions of what happened in the past, not what had _actually_ happened).

s.gif
As they say, quantity has a quality all of its own. If the framerate of a video is so slow as to be a slideshow, then it’s arguably not video anymore. Video has a connotation of appearing temporally continuous to the naked eye.
s.gif
How does a video differ from a stack of video frames? Isn’t that all a video is? A bunch of images stuck together and played back quickly?
s.gif
I'd guess you'd miss any audio that way. But otherwise, yeah a video is a stack of images.
s.gif
You could say that this demo is processing a 2.4s video that is 1.25fps.
s.gif
In a technical sense most video is compressed using motion prediction algorithms so the preprocessing on the data is already significantly different to static images, containing more compressed information. Only the key frames are actually full images and they only make up 1-5% of the frames.

On top of that the video container usually provides synchronized audio packets.

s.gif
Is "it's just processing frames" the new "it's just predicting the next token"?
s.gif
Nothing new here at all: trivializing the intellectual achievements of machines is SOP. This will continue until machines have surpassed every conceivable benchmark. At that point we'll be left with only our epic hubris.
s.gif
Since audio is processed separately this isn't just close to real. it is real. After all what is video if not a stack of frames! :D
s.gif
The video is an actual live demo without any editing or other tricks involved and even includes reasonable mistakes and the code used. It is not close to real, it's just real.
The part that really confuses me is the lack of a "*some sequences simulated" disclaimer.
s.gif
The Gemini demo had a disclaimer in the beginning, albeit not a very clear one.
s.gif
ANY indication the video was fake? I see none. The example at hand is nice and all, but if you say someone is faking you better have receipts.
s.gif
I’m not making the claim, OP is. It’s not hard to arrive at this conclusion yourself though. Here is Yannick’s take on it:

https://youtu.be/zut38E-BHH0?si=oj3S3qWLgvx3743I

If you need more than that, then I’ll admit to being more cynical about big tech companies than you are.

s.gif
Indeed they've been sad starting in 2010 and on... (maybe before)... all the projects they kill.. their IP theft, them doing evil, etc
Lmao! So, presumably, they could have hired Greg to improvise almost the exact same demonstration, but with evidence it works. I don't know how much Greg costs, but I'll bet my ass it's less than the cost in investor sentiment after getting caught committing fraud. Not saying you're cheap. Just cheaper.
s.gif
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK