Google is facing controversy among AI experts for a deceptive Gemini promotional video released Wednesday that appears to show its new AI model recognizing visual cues and interacting vocally with a person in real time. As reported by Parmy Olson for Bloomberg, Google has admitted that was not the case. Instead, the researchers fed still images to the model and edited together successful responses, partially misrepresenting the model’s capabilities.
“We created the demo by capturing footage in order to test Gemini’s capabilities on a wide range of challenges,” a spokesperson said. “Then we prompted Gemini using still image frames from the footage, & prompting via text,” a Google spokesperson told Olson. As Olson points out, Google filmed a pair of human hands doing activities, then showed still images to Gemini Ultra, one by one. Google researchers interacted with the model through text, not voice, then picked the best interactions and edited them together with voice synthesis to make the video.
Right now, running still images and text through massive large language models is computationally intensive, which makes real-time video interpretation largely impractical. That was one of the clues that first led AI experts to believe the video was misleading.
“Google’s video made it look like you could show different things to Gemini Ultra in real time and talk to it. You can’t,” Olson wrote in a tweet. A Google spokesperson said that “the user’s voiceover is all real excerpts from the actual prompts used to produce the Gemini output that follows.”
Playing catch-up with hype
Over the past year, upstart OpenAI has embarrassed Google by pulling ahead in generative AI technology, some of which traces its origins to Google research lab breakthroughs. The search giant has been scrambling to catch up since early this year, putting great effort into ChatGPT competitor Bard and large language models like PaLM 2. Google framed Gemini as the first true rival to OpenAI’s GPT-4, which is still widely seen as the market leader in large language models.
At first, it seemed like everything was going to plan. After announcing Google Gemini on Wednesday, the company’s stock was up 5 percent. But soon, AI experts began picking apart Google’s perhaps overhyped claims of “sophisticated reasoning capabilities,” including benchmarks that might not mean much, eventually focusing on the Gemini promotional video with fudged results.
In the contested video, titled “Hands-on with Gemini: Interacting with multimodal AI,” we see a view of what the AI model apparently sees, accompanied by the AI model’s responses on the right side of the screen. The researcher draws squiggly lines and ducks and asks Gemini what it can see. The viewer hears a voice, apparently of Gemini Ultra, responding to the questions.
As Olson points out in her Bloomberg piece, the video also does not specify that the recognition demo likely uses Gemini Ultra, which is not yet available. “Fudging such details points to the broader marketing effort here: Google wants us [to] remember that it’s got one of the largest teams of AI researchers in the world and access to more data than anyone else,” Olson wrote.
Taken alone, and if represented more accurately (as they are on this Google blog page), Gemini’s image recognition abilities are nothing to sneeze at. They seem roughly on par with the capabilities of OpenAI’s multimodal GPT-4V (GPT-4 with vision) AI model, which can also recognize the content of still images. But when edited together seamlessly for promotional purposes, it made Google’s Gemini model seem more capable than it is, and that had many people hyped up.
“I can’t stop thinking about the implications of this demo,” tweeted TED organizer Chris Anderson on Thursday. “Surely it’s not crazy to think that sometime next year, a fledgling Gemini 2.0 could attend a board meeting, read the briefing docs, look at the slides, listen to every one’s words, and make intelligent contributions to the issues debated? Now tell me. Wouldn’t that count as AGI?”
“That demo was incredibly edited to suggest that Gemini is far more capable than it is,” replied pioneering software engineer Grady Booch. “You’ve been deceived, Chris. And shame on them for so doing.”
https://arstechnica.com/?p=1989616