Has Google Gone Too Far with the Bias Toward Its Own Content?

Posted by ajfried

Since the beginning of SEO time, practitioners have been trying to crack the Google algorithm. Every once in a while, the industry gets a glimpse into how the search giant works and we have opportunity to deconstruct it. We don’t get many of these opportunities, but when we do—assuming we spot them in time—we try to take advantage of them so we can “fix the Internet.”

On Feb. 16, 2015, news started to circulate that NBC would start removing images and references of Brian Williams from its website.

This was it!

A golden opportunity.

This was our chance to learn more about the Knowledge Graph.

Expectation vs. reality

Often it’s difficult to predict what Google is truly going to do. We expect something to happen, but in reality it’s nothing like we imagined.

Expectation

What we expected to see was that Google would change the source of the image. Typically, if you hover over the image in the Knowledge Graph, it reveals the location of the image.

Keanu-Reeves-Image-Location.gif

This would mean that if the image disappeared from its original source, then the image displayed in the Knowledge Graph would likely change or even disappear entirely.

Reality (February 2015)

The only problem was, there was no official source (this changed, as you will soon see) and identifying where the image was coming from proved extremely challenging. In fact, when you clicked on the image, it took you to an image search result that didn’t even include the image.

Could it be? Had Google started its own database of owned or licensed images and was giving it priority over any other sources?

In order to find the source, we tried taking the image from the Knowledge Graph and “search by image” in images.google.com to find others like it. For the NBC Nightly News image, Google failed to even locate a match to the image it was actually using anywhere on the Internet. For other television programs, it was successful. Here is an example of what happened for Morning Joe:

Morning_Joe_image_search.png

So we found the potential source. In fact, we found three potential sources. Seemed kind of strange, but this seemed to be the discovery we were looking for.

This looks like Google is using someone else’s content and not referencing it. These images have a source, but Google is choosing not to show it.

Then Google pulled the ol’ switcheroo.

New reality (March 2015)

Now things changed and Google decided to put a source to their images. Unfortunately, I mistakenly assumed that hovering over an image showed the same thing as the file path at the bottom, but I was wrong. The URL you see when you hover over an image in the Knowledge Graph is actually nothing more than the title. The source is different.

Morning_Joe_Source.png

Luckily, I still had two screenshots I took when I first saw this saved on my desktop. Success. One screen capture was from NBC Nightly News, and the other from the news show Morning Joe (see above) showing that the source was changed.

NBC-nightly-news-crop.png

(NBC Nightly News screenshot.)

The source is a Google-owned property: gstatic.com. You can clearly see the difference in the source change. What started as a hypothesis in now a fact. Google is certainly creating a database of images.

If this is the direction Google is moving, then it is creating all kinds of potential risks for brands and individuals. The implications are a loss of control for any brand that is looking to optimize its Knowledge Graph results. As well, it seems this poses a conflict of interest to Google, whose mission is to organize the world’s information, not license and prioritize it.

How do we think Google is supposed to work?

Google is an information-retrieval system tasked with sourcing information from across the web and supplying the most relevant results to users’ searches. In recent months, the search giant has taken a more direct approach by answering questions and assumed questions in the Answer Box, some of which come from un-credited sources. Google has clearly demonstrated that it is building a knowledge base of facts that it uses as the basis for its Answer Boxes. When it sources information from that knowledge base, it doesn’t necessarily reference or credit any source.

However, I would argue there is a difference between an un-credited Answer Box and an un-credited image. An un-credited Answer Box provides a fact that is indisputable, part of the public domain, unlikely to change (e.g., what year was Abraham Lincoln shot? How long is the George Washington Bridge?) Answer Boxes that offer more than just a basic fact (or an opinion, instructions, etc.) always credit their sources.

There are four possibilities when it comes to Google referencing content:

  • Option 1: It credits the content because someone else owns the rights to it
  • Option 2: It doesn’t credit the content because it’s part of the public domain, as seen in some Answer Box results
  • Option 3: It doesn’t reference it because it owns or has licensed the content. If you search for “Chicken Pox” or other diseases, Google appears to be using images from licensed medical illustrators. The same goes for song lyrics, which Eric Enge discusses here: Google providing credit for content. This adds to the speculation that Google is giving preference to its own content by displaying it over everything else.
  • Option 4: It doesn’t credit the content, but neither does it necessarily own the rights to the content. This is a very gray area, and is where Google seemed to be back in February. If this were the case, it would imply that Google is “stealing” content—which I find hard to believe, but felt was necessary to include in this post for the sake of completeness.

Is this an isolated incident?

At Five Blocks, whenever we see these anomalies in search results, we try to compare the term in question against others like it. This is a categorization concept we use to bucket individuals or companies into similar groups. When we do this, we uncover some incredible trends that help us determine what a search result “should” look like for a given group. For example, when looking at searches for a group of people or companies in an industry, this grouping gives us a sense of how much social media presence the group has on average or how much media coverage it typically gets.

Upon further investigation of terms similar to NBC Nightly News (other news shows), we noticed the un-credited image scenario appeared to be a trend in February, but now all of the images are being hosted on gstatic.com. When we broadened the categories further to TV shows and movies, the trend persisted. Rather than show an image in the Knowledge Graph and from the actual source, Google tends to show an image and reference the source from Google’s own database of stored images.

And just to ensure this wasn’t a case of tunnel vision, we researched other categories, including sports teams, actors and video games, in addition to spot-checking other genres.

Unlike terms for specific TV shows and movies, terms in each of these other groups all link to the actual source in the Knowledge Graph.

Immediate implications

It’s easy to ignore this and say “Well, it’s Google. They are always doing something.” However, there are some serious implications to these actions:

  1. The TV shows/movies aren’t receiving their due credit because, from within the Knowledge Graph, there is no actual reference to the show’s official site
  2. The more Google moves toward licensing and then retrieving their own information, the more biased they become, preferring their own content over the equivalent—or possibly even superior—content from another source
  3. If feels wrong and misleading to get a Google Image Search result rather than an actual site because:
    • The search doesn’t include the original image
    • Considering how poor Image Search results are normally, it feels like a poor experience
  4. If Google is moving toward licensing as much content as possible, then it could make the Knowledge Graph infinitely more complicated when there is a “mistake” or something unflattering. How could one go about changing what Google shows about them?

Google is objectively becoming subjective

It is clear that Google is attempting to create databases of information, including lyrics stored in Google Play, photos, and, previously, facts in Freebase (which is now Wikidata and not owned by Google).

I am not normally one to point my finger and accuse Google of wrongdoing. But this really strikes me as an odd move, one bordering on a clear bias to direct users to stay within the search engine. The fact is, we trust Google with a heck of a lot of information with our searches. In return, I believe we should expect Google to return an array of relevant information for searchers to decide what they like best. The example cited above seems harmless, but what about determining which is the right religion? Or even who the prettiest girl in the world is?

Religion-and-beauty-queries.png

Questions such as these, which Google is returning credited answers for, could return results that are perceived as facts.

Should we next expect Google to decide who is objectively the best service provider (e.g., pizza chain, painter, or accountant), then feature them in an un-credited answer box? The direction Google is moving right now, it feels like we should be calling into question their objectivity.

But that’s only my (subjective) opinion.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 3 years ago from tracking.feedpress.it

I Can’t Drive 155: Meta Descriptions in 2015

Posted by Dr-Pete

For years now, we (and many others) have been recommending keeping your Meta Descriptions shorter than
about 155-160 characters. For months, people have been sending me examples of search snippets that clearly broke that rule, like this one (on a search for “hummingbird food”):

For the record, this one clocks in at 317 characters (counting spaces). So, I set out to discover if these long descriptions were exceptions to the rule, or if we need to change the rules. I collected the search snippets across the MozCast 10K, which resulted in 92,669 snippets. All of the data in this post was collected on April 13, 2015.

The Basic Data

The minimum snippet length was zero characters. There were 69 zero-length snippets, but most of these were the new generation of answer box, that appears organic but doesn’t have a snippet. To put it another way, these were misidentified as organic by my code. The other 0-length snippets were local one-boxes that appeared as organic but had no snippet, such as this one for “chichen itza”:

These zero-length snippets were removed from further analysis, but considering that they only accounted for 0.07% of the total data, they didn’t really impact the conclusions either way. The shortest legitimate, non-zero snippet was 7 characters long, on a search for “geek and sundry”, and appears to have come directly from the site’s meta description:

The maximum snippet length that day (this is a highly dynamic situation) was 372 characters. The winner appeared on a search for “benefits of apple cider vinegar”:

The average length of all of the snippets in our data set (not counting zero-length snippets) was 143.5 characters, and the median length was 152 characters. Of course, this can be misleading, since some snippets are shorter than the limit and others are being artificially truncated by Google. So, let’s dig a bit deeper.

The Bigger Picture

To get a better idea of the big picture, let’s take a look at the display length of all 92,600 snippets (with non-zero length), split into 20-character buckets (0-20, 21-40, etc.):

Most of the snippets (62.1%) cut off as expected, right in the 141-160 character bucket. Of course, some snippets were shorter than that, and didn’t need to be cut off, and some broke the rules. About 1% (1,010) of the snippets in our data set measured 200 or more characters. That’s not a huge number, but it’s enough to take seriously.

That 141-160 character bucket is dwarfing everything else, so let’s zoom in a bit on the cut-off range, and just look at snippets in the 120-200 character range (in this case, by 5-character bins):

Zooming in, the bulk of the snippets are displaying at lengths between about 146-165 characters. There are plenty of exceptions to the 155-160 character guideline, but for the most part, they do seem to be exceptions.

Finally, let’s zoom in on the rule-breakers. This is the distribution of snippets displaying 191+ characters, bucketed in 10-character bins (191-200, 201-210, etc.):

Please note that the Y-axis scale is much smaller than in the previous 2 graphs, but there is a pretty solid spread, with a decent chunk of snippets displaying more than 300 characters.

Without looking at every original meta description tag, it’s very difficult to tell exactly how many snippets have been truncated by Google, but we do have a proxy. Snippets that have been truncated end in an ellipsis (…), which rarely appears at the end of a natural description. In this data set, more than half of all snippets (52.8%) ended in an ellipsis, so we’re still seeing a lot of meta descriptions being cut off.

I should add that, unlike titles/headlines, it isn’t clear whether Google is cutting off snippets by pixel width or character count, since that cut-off is done on the server-side. In most cases, Google will cut before the end of the second line, but sometimes they cut well before this, which could suggest a character-based limit. They also cut off at whole words, which can make the numbers a bit tougher to interpret.

The Cutting Room Floor

There’s another difficulty with telling exactly how many meta descriptions Google has modified – some edits are minor, and some are major. One minor edit is when Google adds some additional information to a snippet, such as a date at the beginning. Here’s an example (from a search for “chicken pox”):

With the date (and minus the ellipsis), this snippet is 164 characters long, which suggests Google isn’t counting the added text against the length limit. What’s interesting is that the rest comes directly from the meta description on the site, except that the site’s description starts with “Chickenpox.” and Google has removed that keyword. As a human, I’d say this matches the meta description, but a bot has a very hard time telling a minor edit from a complete rewrite.

Another minor rewrite occurs in snippets that start with search result counts:

Here, we’re at 172 characters (with spaces and minus the ellipsis), and Google has even let this snippet roll over to a third line. So, again, it seems like the added information at the beginning isn’t counting against the length limit.

All told, 11.6% of the snippets in our data set had some kind of Google-generated data, so this type of minor rewrite is pretty common. Even if Google honors most of your meta description, you may see small edits.

Let’s look at our big winner, the 372-character description. Here’s what we saw in the snippet:

Jan 26, 2015 – Health• Diabetes Prevention: Multiple studies have shown a correlation between apple cider vinegar and lower blood sugar levels. … • Weight Loss: Consuming apple cider vinegar can help you feel more full, which can help you eat less. … • Lower Cholesterol: … • Detox: … • Digestive Aid: … • Itchy or Sunburned Skin: … • Energy Boost:1 more items

So, what about the meta description? Here’s what we actually see in the tag:

Were you aware of all the uses of apple cider vinegar? From cleansing to healing, to preventing diabetes, ACV is a pantry staple you need in your home.

That’s a bit more than just a couple of edits. So, what’s happening here? Well, there’s a clue on that same page, where we see yet another rule-breaking snippet:

You might be wondering why this snippet is any more interesting than the other one. If you could see the top of the SERP, you’d know why, because it looks something like this:

Google is automatically extracting list-style data from these pages to fuel the expansion of the Knowledge Graph. In one case, that data is replacing a snippet
and going directly into an answer box, but they’re performing the same translation even for some other snippets on the page.

So, does every 2nd-generation answer box yield long snippets? After 3 hours of inadvisable mySQL queries, I can tell you that the answer is a resounding “probably not”. You can have 2nd-gen answer boxes without long snippets and you can have long snippets without 2nd-gen answer boxes,
but there does appear to be a connection between long snippets and Knowledge Graph in some cases.

One interesting connection is that Google has begun bolding keywords that seem like answers to the query (and not just synonyms for the query). Below is an example from a search for “mono symptoms”. There’s an answer box for this query, but the snippet below is not from the site in the answer box:

Notice the bolded words – “fatigue”, “sore throat”, “fever”, “headache”, “rash”. These aren’t synonyms for the search phrase; these are actual symptoms of mono. This data isn’t coming from the meta description, but from a bulleted list on the target page. Again, it appears that Google is trying to use the snippet to answer a question, and has gone well beyond just matching keywords.

Just for fun, let’s look at one more, where there’s no clear connection to the Knowledge Graph. Here’s a snippet from a search for “sons of anarchy season 4”:

This page has no answer box, and the information extracted is odd at best. The snippet bears little or no resemblance to the site’s meta description. The number string at the beginning comes out of a rating widget, and some of the text isn’t even clearly available on the page. This seems to be an example of Google acknowledging IMDb as a high-authority site and desperately trying to match any text they can to the query, resulting in a Frankenstein’s snippet.

The Final Verdict

If all of this seems confusing, that’s probably because it is. Google is taking a lot more liberties with snippets these days, both to better match queries, to add details they feel are important, or to help build and support the Knowledge Graph.

So, let’s get back to the original question – is it time to revise the 155(ish) character guideline? My gut feeling is: not yet. To begin with, the vast majority of snippets are still falling in that 145-165 character range. In addition, the exceptions to the rule are not only atypical situations, but in most cases those long snippets don’t seem to represent the original meta description. In other words, even if Google does grant you extra characters, they probably won’t be the extra characters you asked for in the first place.

Many people have asked: “How do I make sure that Google shows my meta description as is?” I’m afraid the answer is: “You don’t.” If this is very important to you, I would recommend keeping your description below the 155-character limit, and making sure that it’s a good match to your target keyword concepts. I suspect Google is going to take more liberties with snippets over time, and we’re going to have to let go of our obsession with having total control over the SERPs.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Whats New In SEO Trends 2015 Panel Discussion

2015 can be a record year for your digital marketing efforts. But, Google’s radical changes in 2014 – algorithm updates, Answer Boxes, Knowledge Graph – can be overwhelming to hardworking search…

Reblogged 3 years ago from www.youtube.com

How We Fixed the Internet (Ok, an Answer Box)

Posted by Dr-Pete

Last year, Google expanded the Knowledge Graph to use data extracted (*cough* scraped) from the index to create answer boxes. Back in October, I wrote about a failed experiment. One of my posts, an odd dive
into Google’s revenue, was being answer-fied for the query “How much does Google make?”:

Objectively speaking, even I could concede that this wasn’t a very good answer in 2014. I posted it on Twitter, and
David Iwanow asked the inevitable question:

Enthusiasm may have gotten the best of us, a few more people got involved (like my former Moz colleague
Ruth Burr Reedy), and suddenly we were going to fix this once and for all:

There Was Just One Problem

I updated the post, carefully rewriting the first paragraph to reflect the new reality of Google’s revenue. I did my best to make the change user-friendly, adding valuable information but not disrupting the original post. I did, however, completely replace the old text that Google was scraping.

Within less than a day, Google had re-cached the content, and I just had to wait to see the new answer box. So, I waited, and waited… and waited. Two months later, still no change. Some days, the SERP showed no answer box at all (although I’ve since found these answer boxes are very dynamic), and I was starting to wonder if it was all a mistake.

Then, Something Happened

Last week, months after I had given up, I went to double-check this query for entirely different reasons, and I saw the following:

Google had finally updated the answer box with the new text, and they had even pulled an image from the post. It was a strange choice of images, but in fairness, it was a strange post.

Interestingly, Google also added the publication date of the post, perhaps recognizing that outdated answers aren’t always useful. Unfortunately, this doesn’t reflect the timing of the new content, but that’s understandable – Google doesn’t have easy access to that data.

It’s interesting to note that sometimes Google shows the image, and sometimes they don’t. This seems to be independent of whether the SERP is personalized or incognito. Here’s a capture of the image-free version, along with the #1 organic ranking:

You’ll notice that the #1 result is also my Moz post, and that result has an expanded meta description. So, the same URL is essentially double-dipping this SERP. This isn’t always the case – answers can be extracted from URLs that appear lower on page 1 (although almost always page 1, in my experience). Anecdotally, it’s also not always the case that these organic result ends up getting an expanded meta description.

However, it definitely seems that some of the quality signals driving organic ranking and expanded meta descriptions are also helping Google determine whether a query deserves a direct answer. Put simply, it’s not an accident that this post was chosen to answer this question.

What Does This Mean for You?

Let’s start with the obvious – Yes, the v2 answer boxes (driven by the index, not Freebase/WikiData)
can be updated. However, the update cycle is independent of the index’s refresh cycle. In other words, just because a post is re-cached, it doesn’t mean the answer box will update. Presumably, Google is creating a second Knowledge Graph, based on the index, and this data is only periodically updated.

It’s also entirely possible that updating could cause you to lose an answer box, if the new data weren’t a strong match to the question or the quality of the content came into question. Here’s an interesting question – on a query where a competitor has an answer box, could you change your own content enough to either replace them or knock out the answer box altogether? We are currently testing this question, but it may be a few more months before we have any answers.

Another question is what triggers this style of answer box in the first place? Eric Enge has an
in-depth look at 850,000 queries that’s well worth your time, and in many cases Google is still triggering on obvious questions (“how”, “what”, “where”, etc.). Nouns that could be interpreted as ambiguous also can trigger the new answer boxes. For example, a search for “ruby” is interpreted by Google as roughly meaning “What is Ruby?”:

This answer box also triggers “Related topics” that use content pulled from other sites but drive users to more Google searches. The small, gray links are the source sites. The much more visible, blue links are more Google searches.

Note that these also have to be questions (explicit or implied) that Google can’t answer with their curated Knowledge Graph (based on sources like Freebase and WikiData). So, for example, the question “When is Mother’s Day?” triggers an older-style answer:

Sites offering this data aren’t going to have a chance to get attribution, because Google essentially already owns the answer to this question as part of their core Knowledge Graph.

Do You Want to Be An Answer?

This is where things get tricky. At this point, we have no clear data on how these answer boxes impact CTR, and it’s likely that the impact depends a great deal on the context. I think we’re facing a certain degree of inevitability – if Google is going to list an answer, better it’s your answer then someone else’s, IMO. On the other hand, what if that answer is so complete that it renders your URL irrelevant? Consider, for example, the SERP for “how to make grilled cheese”:

Sorry, Food Network, but making a grilled cheese sandwich isn’t really that hard, and this answer box doesn’t leave much to the imagination. As these answers get more and more thorough, expect CTRs to fall.

For now, I’d argue that it’s better to have your link in the box than someone else’s, but that’s cold comfort in many cases. These new answer boxes represent what I feel is a dramatic shift in the relationship between Google and webmasters, and they may be tipping the balance. For now, we can’t do much but wait, see, and experiment.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

What’s New In SEO for 2015: The Future of Digital Marketing – Pro Panel

2015 can be a record year for your digital marketing efforts. But, Google’s radical changes in 2014 – algorithm updates, Answer Boxes, Knowledge Graph – can …

Reblogged 4 years ago from www.youtube.com

More Google Answer Boxes, with Bonus Experiment!

Posted by Dr-Pete

Last week, drowned out by the Panda 4.1 rollout, the 
MozCast Feature Graph detected a significant jump in the presence of answer boxes (+42% day-over-day, up to +44% on September 30th):

This measurement includes all types of “answer” boxes – direct answers, stock quotes, weather forecasts, box scores, and even the new, attributed answer boxes. Digging into the data, it appears that almost the entirety of the jump is in the new style of answer boxes. These are the answers that are extracted from 3rd-party websites, and they look something like this:

The key distinction is that you’ll see a search-result-style title and link below the answer. Separating just this data, the same two-week graph looks like this:

The day-over-day increase from September 25-26 in new answer boxes was +98%, almost doubling the total number in our data set. This clearly represents a significant expansion in Google’s ability to extract and display answers.

The “Winning” Queries

Over 100 queries picked up the new answer boxes in our data set. Below are 10 examples. Keep in mind that any given query may gain or lose its answer box for any given search, depending on factors such as search history, localization, and personalization:

  1. global warming
  2. mba
  3. steampunk
  4. dsl
  5. triathlon
  6. pollution
  7. firewall
  8. activex
  9. vegan
  10. project management

Many of these are general, informational answers, and quite a few of the new answer boxes in our data set seem to be coming directly from Wikipedia. With this update, Google also may have added a new capability – here’s the answer box for #3 above (“steampunk”):

The image on the right is being extracted directly from the article. While we’ve seen some examples of brand boxes with logos, the ability to directly add general images seems to be new. Other new answer boxes are more traditional, such as “mba”:

Many of these new queries seem to be broad, “head” queries, but that could be a result of our data set, which tends to be skewed toward shorter, commercial queries. One four-word query with a new answer box was “girl scout cookies types”:

It’s interesting to note that the more grammatically correct “girl scout cookie types” doesn’t seem to return an answer box. These new answers seem to be very dependent on query structure and how the query matches on-page keywords.

An Experiment in Answers

If Google is pulling more and more answers directly from the index (i.e. our sites), then it stands to reason we could update those answers. A couple of months ago, I noticed that one of my posts was producing an answer box for the search “how much does google make”:

Even as the author of this post, I had to admit that was a pretty terrible answer, especially being 3-4 years out of date. I quickly assembled a Twitter mob to deal with this problem (well, basically 
Ruth Burr Reedy and David Iwanow), and we unanimously decided something must be done:

I decided to edit the top of the post, adding a user-friendly update for new visitors that gave new numbers for 2013. This went up on July 10th – I posted the update on social, and by later that day the new page was cached.

Two weeks went by, and there was no change to the answer box. Naturally, I assumed this was because the old text was still in place (I had simply added new information). So, on July 24th, I carefully removed the old content (that appears in the answer box) and edited the META description. By the next day, the new page was cached and the new snippet was showing up in Google SERPs.

So, what does that answer box look like today, almost two months later? Look up four paragraphs, because it’s exactly the same. Even though the content used in this answer box is now completely gone, Google is still using it in search results.

While this is only one example, it seems to suggest that these answers are not being extracted and created in real-time – they’re being stored in some sort of internal Google knowledge base. This may sound familiar, if you’ve read anything over the last month about Google’s theoretical 
Knowledge Vault.

Unlike Freebase-based Knowledge panels and answers, this internal vault can’t be edited directly. Unlike organic results, where changes to our pages are generally reflected on the next crawl-and-cache, these answer boxes are being updated much less frequently. Since these new answers link directly to pages, they could be connecting to information that’s been mismatched for weeks or even months.

At this point, there’s very little anyone outside of Google can do but keep their eyes open. If this is truly the Knowledge Vault in action, it’s going to grow, impacting more queries and potentially drawing more traffic away from sites. At the same time, Google may be becoming more possessive of that information, and will probably try to remove any kind of direct, third-party editing (which is possible, if difficult, with the current Knowledge Graph).

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from feedproxy.google.com