Spam Score: Moz’s New Metric to Measure Penalization Risk

Posted by randfish

Today, I’m very excited to announce that Moz’s Spam Score, an R&D project we’ve worked on for nearly a year, is finally going live. In this post, you can learn more about how we’re calculating spam score, what it means, and how you can potentially use it in your SEO work.

How does Spam Score work?

Over the last year, our data science team, led by 
Dr. Matt Peters, examined a great number of potential factors that predicted that a site might be penalized or banned by Google. We found strong correlations with 17 unique factors we call “spam flags,” and turned them into a score.

Almost every subdomain in 
Mozscape (our web index) now has a Spam Score attached to it, and this score is viewable inside Open Site Explorer (and soon, the MozBar and other tools). The score is simple; it just records the quantity of spam flags the subdomain triggers. Our correlations showed that no particular flag was more likely than others to mean a domain was penalized/banned in Google, but firing many flags had a very strong correlation (you can see the math below).

Spam Score currently operates only on the subdomain level—we don’t have it for pages or root domains. It’s been my experience and the experience of many other SEOs in the field that a great deal of link spam is tied to the subdomain-level. There are plenty of exceptions—manipulative links can and do live on plenty of high-quality sites—but as we’ve tested, we found that subdomain-level Spam Score was the best solution we could create at web scale. It does a solid job with the most obvious, nastiest spam, and a decent job highlighting risk in other areas, too.

How to access Spam Score

Right now, you can find Spam Score inside 
Open Site Explorer, both in the top metrics (just below domain/page authority) and in its own tab labeled “Spam Analysis.” Spam Score is only available for Pro subscribers right now, though in the future, we may make the score in the metrics section available to everyone (if you’re not a subscriber, you can check it out with a free trial). 

The current Spam Analysis page includes a list of subdomains or pages linking to your site. You can toggle the target to look at all links to a given subdomain on your site, given pages, or the entire root domain. You can further toggle source tier to look at the Spam Score for incoming linking pages or subdomains (but in the case of pages, we’re still showing the Spam Score for the subdomain on which that page is hosted).

You can click on any Spam Score row and see the details about which flags were triggered. We’ll bring you to a page like this:

Back on the original Spam Analysis page, at the very bottom of the rows, you’ll find an option to export a disavow file, which is compatible with Google Webmaster Tools. You can choose to filter the file to contain only those sites with a given spam flag count or higher:

Disavow exports usually take less than 3 hours to finish. We can send you an email when it’s ready, too.

WARNING: Please do not export this file and simply upload it to Google! You can really, really hurt your site’s ranking and there may be no way to recover. Instead, carefully sort through the links therein and make sure you really do want to disavow what’s in there. You can easily remove/edit the file to take out links you feel are not spam. When Moz’s Cyrus Shepard disavowed every link to his own site, it took more than a year for his rankings to return!

We’ve actually made the file not-wholly-ready for upload to Google in order to be sure folks aren’t too cavalier with this particular step. You’ll need to open it up and make some edits (specifically to lines at the top of the file) in order to ready it for Webmaster Tools

In the near future, we hope to have Spam Score in the Mozbar as well, which might look like this: 

Sweet, right? 🙂

Potential use cases for Spam Analysis

This list probably isn’t exhaustive, but these are a few of the ways we’ve been playing around with the data:

  1. Checking for spammy links to your own site: Almost every site has at least a few bad links pointing to it, but it’s been hard to know how much or how many potentially harmful links you might have until now. Run a quick spam analysis and see if there’s enough there to cause concern.
  2. Evaluating potential links: This is a big one where we think Spam Score can be helpful. It’s not going to catch every potentially bad link, and you should certainly still use your brain for evaluation too, but as you’re scanning a list of link opportunities or surfing to various sites, having the ability to see if they fire a lot of flags is a great warning sign.
  3. Link cleanup: Link cleanup projects can be messy, involved, precarious, and massively tedious. Spam Score might not catch everything, but sorting links by it can be hugely helpful in identifying potentially nasty stuff, and filtering out the more probably clean links.
  4. Disavow Files: Again, because Spam Score won’t perfectly catch everything, you will likely need to do some additional work here (especially if the site you’re working on has done some link buying on more generally trustworthy domains), but it can save you a heap of time evaluating and listing the worst and most obvious junk.

Over time, we’re also excited about using Spam Score to help improve the PA and DA calculations (it’s not currently in there), as well as adding it to other tools and data sources. We’d love your feedback and insight about where you’d most want to see Spam Score get involved.

Details about Spam Score’s calculation

This section comes courtesy of Moz’s head of data science, Dr. Matt Peters, who created the metric and deserves (at least in my humble opinion) a big round of applause. – Rand

Definition of “spam”

Before diving into the details of the individual spam flags and their calculation, it’s important to first describe our data gathering process and “spam” definition.

For our purposes, we followed Google’s definition of spam and gathered labels for a large number of sites as follows.

  • First, we randomly selected a large number of subdomains from the Mozscape index stratified by mozRank.
  • Then we crawled the subdomains and threw out any that didn’t return a “200 OK” (redirects, errors, etc).
  • Finally, we collected the top 10 de-personalized, geo-agnostic Google-US search results using the full subdomain name as the keyword and checked whether any of those results matched the original keyword. If they did not, we called the subdomain “spam,” otherwise we called it “ham.”

We performed the most recent data collection in November 2014 (after the Penguin 3.0 update) for about 500,000 subdomains.

Relationship between number of flags and spam

The overall Spam Score is currently an aggregate of 17 different “flags.” You can think of each flag a potential “warning sign” that signals that a site may be spammy. The overall likelihood of spam increases as a site accumulates more and more flags, so that the total number of flags is a strong predictor of spam. Accordingly, the flags are designed to be used together—no single flag, or even a few flags, is cause for concern (and indeed most sites will trigger at least a few flags).

The following table shows the relationship between the number of flags and percent of sites with those flags that we found Google had penalized or banned:

ABOVE: The overall probability of spam vs. the number of spam flags. Data collected in Nov. 2014 for approximately 500K subdomains. The table also highlights the three overall danger levels: low/green (< 10%) moderate/yellow (10-50%) and high/red (>50%)

The overall spam percent averaged across a large number of sites increases in lock step with the number of flags; however there are outliers in every category. For example, there are a small number of sites with very few flags that are tagged as spam by Google and conversely a small number of sites with many flags that are not spam.

Spam flag details

The individual spam flags capture a wide range of spam signals link profiles, anchor text, on page signals and properties of the domain name. At a high level the process to determine the spam flags for each subdomain is:

  • Collect link metrics from Mozscape (mozRank, mozTrust, number of linking domains, etc).
  • Collect anchor text metrics from Mozscape (top anchor text phrases sorted by number of links)
  • Collect the top five pages by Page Authority on the subdomain from Mozscape
  • Crawl the top five pages plus the home page and process to extract on page signals
  • Provide the output for Mozscape to include in the next index release cycle

Since the spam flags are incorporated into in the Mozscape index, fresh data is released with each new index. Right now, we crawl and process the spam flags for each subdomains every two – three months although this may change in the future.

Link flags

The following table lists the link and anchor text related flags with the the odds ratio for each flag. For each flag, we can compute two percents: the percent of sites with that flag that are penalized by Google and the percent of sites with that flag that were not penalized. The odds ratio is the ratio of these percents and gives the increase in likelihood that a site is spam if it has the flag. For example, the first row says that a site with this flag is 12.4 times more likely to be spam than one without the flag.

ABOVE: Description and odds ratio of link and anchor text related spam flags. In addition to a description, it lists the odds ratio for each flag which gives the overall increase in spam likelihood if the flag is present).

Working down the table, the flags are:

  • Low mozTrust to mozRank ratio: Sites with low mozTrust compared to mozRank are likely to be spam.
  • Large site with few links: Large sites with many pages tend to also have many links and large sites without a corresponding large number of links are likely to be spam.
  • Site link diversity is low: If a large percentage of links to a site are from a few domains it is likely to be spam.
  • Ratio of followed to nofollowed subdomains/domains (two separate flags): Sites with a large number of followed links relative to nofollowed are likely to be spam.
  • Small proportion of branded links (anchor text): Organically occurring links tend to contain a disproportionate amount of banded keywords. If a site does not have a lot of branded anchor text, it’s a signal the links are not organic.

On-page flags

Similar to the link flags, the following table lists the on page and domain name related flags:

ABOVE: Description and odds ratio of on page and domain name related spam flags. In addition to a description, it lists the odds ratio for each flag which gives the overall increase in spam likelihood if the flag is present).

  • Thin content: If a site has a relatively small ratio of content to navigation chrome it’s likely to be spam.
  • Site mark-up is abnormally small: Non-spam sites tend to invest in rich user experiences with CSS, Javascript and extensive mark-up. Accordingly, a large ratio of text to mark-up is a spam signal.
  • Large number of external links: A site with a large number of external links may look spammy.
  • Low number of internal links: Real sites tend to link heavily to themselves via internal navigation and a relative lack of internal links is a spam signal.
  • Anchor text-heavy page: Sites with a lot of anchor text are more likely to be spam then those with more content and less links.
  • External links in navigation: Spam sites may hide external links in the sidebar or footer.
  • No contact info: Real sites prominently display their social and other contact information.
  • Low number of pages found: A site with only one or a few pages is more likely to be spam than one with many pages.
  • TLD correlated with spam domains: Certain TLDs are more spammy than others (e.g. pw).
  • Domain name length: A long subdomain name like “bycheapviagra.freeshipping.onlinepharmacy.com” may indicate keyword stuffing.
  • Domain name contains numerals: domain names with numerals may be automatically generated and therefore spam.

If you’d like some more details on the technical aspects of the spam score, check out the 
video of Matt’s 2012 MozCon talk about Algorithmic Spam Detection or the slides (many of the details have evolved, but the overall ideas are the same):

We’d love your feedback

As with all metrics, Spam Score won’t be perfect. We’d love to hear your feedback and ideas for improving the score as well as what you’d like to see from it’s in-product application in the future. Feel free to leave comments on this post, or to email Matt (matt at moz dot com) and me (rand at moz dot com) privately with any suggestions.

Good luck cleaning up and preventing link spam!



Not a Pro Subscriber? No problem!



Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

How We Fixed the Internet (Ok, an Answer Box)

Posted by Dr-Pete

Last year, Google expanded the Knowledge Graph to use data extracted (*cough* scraped) from the index to create answer boxes. Back in October, I wrote about a failed experiment. One of my posts, an odd dive
into Google’s revenue, was being answer-fied for the query “How much does Google make?”:

Objectively speaking, even I could concede that this wasn’t a very good answer in 2014. I posted it on Twitter, and
David Iwanow asked the inevitable question:

Enthusiasm may have gotten the best of us, a few more people got involved (like my former Moz colleague
Ruth Burr Reedy), and suddenly we were going to fix this once and for all:

There Was Just One Problem

I updated the post, carefully rewriting the first paragraph to reflect the new reality of Google’s revenue. I did my best to make the change user-friendly, adding valuable information but not disrupting the original post. I did, however, completely replace the old text that Google was scraping.

Within less than a day, Google had re-cached the content, and I just had to wait to see the new answer box. So, I waited, and waited… and waited. Two months later, still no change. Some days, the SERP showed no answer box at all (although I’ve since found these answer boxes are very dynamic), and I was starting to wonder if it was all a mistake.

Then, Something Happened

Last week, months after I had given up, I went to double-check this query for entirely different reasons, and I saw the following:

Google had finally updated the answer box with the new text, and they had even pulled an image from the post. It was a strange choice of images, but in fairness, it was a strange post.

Interestingly, Google also added the publication date of the post, perhaps recognizing that outdated answers aren’t always useful. Unfortunately, this doesn’t reflect the timing of the new content, but that’s understandable – Google doesn’t have easy access to that data.

It’s interesting to note that sometimes Google shows the image, and sometimes they don’t. This seems to be independent of whether the SERP is personalized or incognito. Here’s a capture of the image-free version, along with the #1 organic ranking:

You’ll notice that the #1 result is also my Moz post, and that result has an expanded meta description. So, the same URL is essentially double-dipping this SERP. This isn’t always the case – answers can be extracted from URLs that appear lower on page 1 (although almost always page 1, in my experience). Anecdotally, it’s also not always the case that these organic result ends up getting an expanded meta description.

However, it definitely seems that some of the quality signals driving organic ranking and expanded meta descriptions are also helping Google determine whether a query deserves a direct answer. Put simply, it’s not an accident that this post was chosen to answer this question.

What Does This Mean for You?

Let’s start with the obvious – Yes, the v2 answer boxes (driven by the index, not Freebase/WikiData)
can be updated. However, the update cycle is independent of the index’s refresh cycle. In other words, just because a post is re-cached, it doesn’t mean the answer box will update. Presumably, Google is creating a second Knowledge Graph, based on the index, and this data is only periodically updated.

It’s also entirely possible that updating could cause you to lose an answer box, if the new data weren’t a strong match to the question or the quality of the content came into question. Here’s an interesting question – on a query where a competitor has an answer box, could you change your own content enough to either replace them or knock out the answer box altogether? We are currently testing this question, but it may be a few more months before we have any answers.

Another question is what triggers this style of answer box in the first place? Eric Enge has an
in-depth look at 850,000 queries that’s well worth your time, and in many cases Google is still triggering on obvious questions (“how”, “what”, “where”, etc.). Nouns that could be interpreted as ambiguous also can trigger the new answer boxes. For example, a search for “ruby” is interpreted by Google as roughly meaning “What is Ruby?”:

This answer box also triggers “Related topics” that use content pulled from other sites but drive users to more Google searches. The small, gray links are the source sites. The much more visible, blue links are more Google searches.

Note that these also have to be questions (explicit or implied) that Google can’t answer with their curated Knowledge Graph (based on sources like Freebase and WikiData). So, for example, the question “When is Mother’s Day?” triggers an older-style answer:

Sites offering this data aren’t going to have a chance to get attribution, because Google essentially already owns the answer to this question as part of their core Knowledge Graph.

Do You Want to Be An Answer?

This is where things get tricky. At this point, we have no clear data on how these answer boxes impact CTR, and it’s likely that the impact depends a great deal on the context. I think we’re facing a certain degree of inevitability – if Google is going to list an answer, better it’s your answer then someone else’s, IMO. On the other hand, what if that answer is so complete that it renders your URL irrelevant? Consider, for example, the SERP for “how to make grilled cheese”:

Sorry, Food Network, but making a grilled cheese sandwich isn’t really that hard, and this answer box doesn’t leave much to the imagination. As these answers get more and more thorough, expect CTRs to fall.

For now, I’d argue that it’s better to have your link in the box than someone else’s, but that’s cold comfort in many cases. These new answer boxes represent what I feel is a dramatic shift in the relationship between Google and webmasters, and they may be tipping the balance. For now, we can’t do much but wait, see, and experiment.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

A Site Without Good Search Engine Optimization Won\’t Get The Visitors It Deserves

http://www.mikethemoneymaker.com/search-engine-optimization – Higher rankings in search results means more visitors to your site. You will now be givem tips …

Reblogged 5 years ago from www.youtube.com