You are currently viewing Google Data Leak Clarification

Google Data Leak Clarification

During the holidays in the United States, some posts were shared about an alleged data leak related to Google rankings. The first publications about the leak focused on “confirming” the beliefs long held by Rand Fishkin, but not much attention was focused on the context of the information and what it really meant.

Context matters: AI Warehouse document

The leaked document is linked to a public Google Cloud platform called the Document AI Warehouse, which is used to analyze, organize, search and store data. This public documentation is titled Document AI Warehouse overview. A Facebook post said the “leaked” data was the “internal version” of Document AI Warehouse’s publicly visible documentation. This is the context of this data.

Screenshot: Document AI Warehouse

@DavidGQuaid tweeted:

“I think it’s clear that this is an external API for building a document repository, as the name suggests”

This seems to throw cold water on the idea that the “leaked” data is inside information about Google Search.

As far as we know at this point, the “leaked data” shares a similarity with what is on the Document AI Warehouse public page.

Internal search data leak?

The original SparkToro post did not say that the data originated from Google Search. It says that the person who sent the data to Rand Fishkin is the one who made that claim.

One of the things I admire about Rand Fishkin is that he is meticulously precise in his writing, especially when it comes to caveats. Rand accurately notes that the person who provided the data makes the claim that the data originates from Google Search. No evidence, just an allegation.

He writes:

“I received an email from a person who claims to have access to a massive leak of API documentation from Google’s search department.”

Fishkin himself does not confirm that the data has been confirmed by former Google employees to originate from Google Search. He wrote that the person who emailed the data made that claim.

“The email further claims that these leaked documents have been authenticated by former Google employees and that these former employees and others have shared additional personal information about Google’s search operations.”

Fishkin wrote about a subsequent video meeting in which the leaker revealed that his contact with former Googlers was in the context of meeting them at a search industry event. Again, we’ll have to trust the word of the leakers about the former Google employees and that what they said was after a careful review of the data and not an off the record comment.

Fishkin writes that he contacted three former Googlers about it. Notably, these ex-Googlers have not explicitly confirmed that the data is internal to Google Search. They only confirmed that the data appeared to be internal Google information, not that it originated from Google Search.

Fishkin writes what former Googlers told him:

  • “I didn’t have access to that code when I worked there. But this sure looks legit.
  • “It has all the hallmarks of an internal Google API.”
  • “It’s a Java-based API. And someone has spent a lot of time following Google’s internal documentation and naming standards.”
  • “I’ll need more time to be sure, but it matches the internal documentation I’m familiar with.”
  • “Nothing I’ve seen in the brief review suggests it’s anything but legal.”

Saying something comes from Google Search and saying it comes from Google are two different things.

Be open

It is important to be open to the data because there are many unconfirmed things. For example, it is not known if this is an internal search team document. As such, it’s probably not a good idea to take any of this data as useful SEO advice.

Furthermore, it is not advisable to analyze the data to specifically confirm long-held beliefs. This is how one gets caught up in confirmation bias.

Definition of Confirmation Deviation:

“Confirmation bias is the tendency to seek, interpret, prefer, and recall information in a way that confirms or supports one’s prior beliefs or values.”

Confirmation bias will cause one to deny things that are empirically true. For example, there has been the idea for decades that Google automatically prevents a new site from ranking, a theory called the Sandbox. People report every day that their new sites and new pages rank in the top ten of Google search almost immediately.

But if you’re a die-hard believer in the sandbox, then an actual observed experience like this will be removed, no matter how many people observe the opposite experience.

Brenda Malone, Freelance Senior SEO Technical Strategist and Web Developer ( LinkedIn profile ), sent me a message about allegations about Sandbox:

“I personally know, from actual experience, that the Sandbox theory is wrong. I just indexed for two days a personal blog with two posts. There is no way a small site with two posts could have been indexed under the Sandbox theory.”

The bottom line here is that if the documentation turns out to be from Google Search, the wrong way to analyze the data is to look for confirmation of long-held beliefs.

What is the Google data leak?

There are five things to keep in mind about leaked data:

  1. The context of the leaked information is unknown. Is it related to Google Search? Is it for other purposes?
  2. The purpose of the data. Is the information used for real search results? Or was it used for internal management or data manipulation?
  3. Former Google employees did not confirm that the data was specific to Google Search. They only confirmed that it appears to be coming from Google.
  4. Keep an open mind. If you go hunting for vindication of long-held beliefs, guess what? You will find them everywhere. This is called confirmation bias.
  5. Evidence suggests that the data is linked to an external API to build a document repository.

What others are saying about the ‘leaked’ documents

Ryan Jones, a man who not only has extensive experience in SEO, but has a great understanding of computer science, shared some insightful observations about so-called data leakage.

Ryan tweeted:

“We don’t know if this is for production or testing. I guess it’s mostly for testing potential changes.

We don’t know what is being used for the web or for other verticals. Some things may only be used for Google Home or news, etc.

We don’t know what is input to ML algo and what is used to train against. I assume that the clicks are not direct input, but are used to train a model how to predict clickability. (Except current boosts)

I’m also assuming that some of these fields only apply to training datasets, not all sites.

Am I saying Google didn’t lie? Not at all. But let us consider this leak with objections, not with any preconceived bias.

@DavidGQuaid tweeted:

“We also don’t know if this is for Google search or Google cloud document retrieval

APIs seem to pick and choose – this is not how I expect the algorithm to be implemented – what if an engineer wants to skip all these quality checks – looks like I want to build a content warehouse app for my corporate knowledge base”

Is the “leaked” data related to Google Search?

At this point, there is no hard evidence that this “leaked” data is actually from Google Search. There is huge ambiguity about the purpose of the data. Notably, there are hints that this data is simply “an external API to build a document repository, as the name suggests,” and is in no way related to how websites rank in Google Search.

The conclusion that this data did not originate from Google Search is not conclusive at this time, but that is the direction in which the wind of evidence seems to be blowing.

Featured image by Shutterstock/Jaaak

Leave a Reply