You are currently viewing Exclusive: Gemini’s data analysis capabilities aren’t as good as Google claims

Exclusive: Gemini’s data analysis capabilities aren’t as good as Google claims

One of the selling points of Google’s flagship generative AI models, the Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press conferences and demonstrations, Google has repeatedly claimed that models can perform previously impossible tasks thanks to their “long context,” such as summarizing multiple hundred-page documents or searching scenes in movie footage.

But new research shows that models aren’t really very good at these things.

Two separate studies examined how well Google’s Gemini models and others make sense of massive amounts of data—I think the length of War and Peace works. Both found that Gemini 1.5 Pro and 1.5 Flash struggled to correctly answer questions about large datasets; in one series of document-based tests, the models gave the correct answer only 40% 50% of the time.

“Although models like Gemini 1.5 Pro can technically process long contexts, we’ve seen many cases showing that the models don’t actually ‘understand’ the content,” Marcena Karpinski, a postdoc at UMass Amherst and co-author of one of the studies, told TechCrunch.

Gemini context window is missing

A model context or context window refers to input data (eg text) that the model considers before generating output (eg additional text). A simple question—”Who won the 2020 US presidential election?”—can serve as context, as can a movie script, show, or audio clip. And as context windows grow, so does the size of documents that fit into them.

The latest versions of Gemini can accept over 2 million tokens as context. (“Tokens” are subdivided bits of raw data, such as the syllables “fan,” “tas,” and “tic” in the word “fantastic.”) That’s equivalent to roughly 1.4 million words, two hours of video, or 22 hours of audio—the most -the larger context than any commercially available model.

At a briefing earlier this year, Google showed several pre-recorded demos designed to illustrate the potential of Gemini’s long context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing telecast — about 402 pages — for quotes containing jokes, and then find a scene in the telecast that looked like a pencil sketch.

Google DeepMind vice president of research Oriol Vinyals, who led the briefing, described the model as “magical.”

“[1.5 Pro] does these kinds of reasoning tasks in every single page, every single word,” he said.

This may have been an exaggeration.

In one of the aforementioned studies comparing these abilities, Karpinski, along with researchers from the Allen Institute for AI and Princeton, asked models to rate true/false statements about fiction books written in English. The researchers chose recent works so that the models could not “cheat” by relying on hunches, and they peppered the claims with references to specific details and plot points that would be impossible to understand without reading the books in their entirety.

Given a statement like “Using his skills as an Apoth, Nusis is able to reverse-engineer the type of portal opened by the Reagent Key found in Rona’s wooden chest,” Gemini 1.5 Pro and 1.5 Flash – after they absorbed relevant book – you had to say whether the statement was true or false and explain your reasons.

Image Credits: UMass Amherst

Tested on a book about 260,000 words long (~520 pages), the researchers found that 1.5 Pro answered true/false statements correctly 46.7% of the time, while Flash only answered correctly 20% of the time. This means that a coin is significantly better at answering questions about the book than Google’s latest machine learning model. Averaging all the benchmark results, none of the models managed to achieve better than chance in terms of the accuracy of the answers to the questions.

“We noticed that the models had more difficulty verifying claims that required looking at larger parts of the book, or even the entire book, compared to claims that could be resolved by extracting sentence-level evidence,” said Karpinski . “Qualitatively, we also observed that the models struggle with verifying claims about implicit information that is clear to the reader but not explicitly stated in the text.”

The second of the two studies, co-authored by researchers at the University of California, Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “reflect on” videos – that is, to search for and answer questions about the content in them.

Coauthors created a set of images (eg, a photo of a birthday cake) paired with questions for the model to answer about the objects depicted in the images (eg, “What cartoon character is on this cake?”). To evaluate the patterns, they chose one of the images at random and inserted “distractor” images before and after it to create slideshow-like frames.

The Flash didn’t fare so well. In a test where the model transcribed six handwritten digits from a “slideshow” of 25 images, Flash got about 50% of the transcriptions correct. Accuracy dropped to about 30% with eight digits.

“In real-world tasks of answering questions on images, it seems to be particularly difficult for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That small amount of reasoning—recognizing that the number is in a frame and reading it—may be what breaks the pattern.”

Google promises too much with Gemini

None of the studies are peer-reviewed, nor do they examine Gemini 1.5 Pro and 1.5 Flash releases with 2 million token contexts. (Both tested the context editions with 1 million tokens.) And Flash isn’t meant to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.

Regardless, both add fuel to the fire that Google over-promised – and under-delivered – with Gemini from the start. None of the models the researchers tested, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. But Google is the only model provider that has been given billing at the top of the popup in its ads.

“There’s nothing wrong with simply saying, ‘Our model can accept X number of tokens,’ based on the objective technical details,” Saxon said. “But the question is, what useful thing can you do with it?”

Generative AI in general is coming under increasing scrutiny as businesses (and investors) grow frustrated with the technology’s limitations.

In two recent surveys by the Boston Consulting Group, about half of respondents—all C-suite executives—said they did not expect generative AI to lead to significant productivity gains and that they were concerned about the potential for errors and data compromises stemming from generative AI. AI powered tools. PitchBook recently reported that for two consecutive quarters, generative early-stage AI dealmaking has declined, falling 76% from its peak in Q3 2023.

Faced with meeting recap chatbots that call up fictional details about people and AI search platforms that are basically plagiarism generators, customers are on the hunt for promising differentiators. Google — which has been racing, sometimes clumsily, to catch up with its generative AI rivals — was desperate to make the Gemini context one of those differentiators.

But the bet appears to have been premature.

“We haven’t settled on a way to really show that the ‘reasoning’ or ‘understanding’ of long documents is taking place, and essentially every group that runs these models collects their own ad hoc estimates to make these claims,” ​​said Karpinska. “Without knowing how long context processing has been in place — and companies don’t share those details — it’s hard to say how realistic these claims are.”

Google did not respond to a request for comment.

Both Saxon and Karpinski believe the antidotes to generative AI hype are better metrics and, in the same vein, a greater emphasis on third-party criticism. Saxon notes that one of the most common long-context tests (quoted fondly by Google in its marketing materials), the “needle in a haystack,” only measures a model’s ability to extract specific information, such as names and numbers, from sets of data – unanswered complex questions about this information.

“All of the scientists and most of the engineers using these models basically agree that our existing benchmark culture is broken,” Saxon said, “so it’s important for the public to understand to accept these giant reports containing numbers like ‘general intelligence in benchmarks” with a massive grain of salt.”

Leave a Reply