Big Data, Big Problems: 4 Major Link Indexes Compared

Posted by russangular

Given this blog’s readership, chances are good you will spend some time this week looking at backlinks in one of the growing number of link data tools. We know backlinks continue to be one of, if not the most important
parts of Google’s ranking algorithm. We tend to take these link data sets at face value, though, in part because they are all we have. But when your rankings are on the line, is there a better way to get at which data set is the best? How should we go
about assessing these different link indexes like
Moz,
Majestic, Ahrefs and SEMrush for quality? Historically, there have been 4 common approaches to this question of index quality…

  • Breadth: We might choose to look at the number of linking root domains any given service reports. We know
    that referring domains correlates strongly with search rankings, so it makes sense to judge a link index by how many unique domains it has
    discovered and indexed.
  • Depth: We also might choose to look at how deep the web has been crawled, looking more at the total number of URLs
    in the index, rather than the diversity of referring domains.
  • Link Overlap: A more sophisticated approach might count the number of links an index has in common with Google Webmaster
    Tools.
  • Freshness: Finally, we might choose to look at the freshness of the index. What percentage of links in the index are
    still live?

There are a number of really good studies (some newer than others) using these techniques that are worth checking out when you get a chance:

  • BuiltVisible analysis of Moz, Majestic, GWT, Ahrefs and Search Metrics
  • SEOBook comparison of Moz, Majestic, Ahrefs, and Ayima
  • MatthewWoodward
    study of Ahrefs, Majestic, Moz, Raven and SEO Spyglass
  • Marketing Signals analysis of Moz, Majestic, Ahrefs, and GWT
  • RankAbove comparison of Moz, Majestic, Ahrefs and Link Research Tools
  • StoneTemple study of Moz and Majestic

While these are all excellent at addressing the methodologies above, there is a particular limitation with all of them. They miss one of the
most important metrics we need to determine the value of a link index: proportional representation to Google’s link graph
. So here at Angular Marketing, we decided to take a closer look.

Proportional representation to Google Search Console data

So, why is it important to determine proportional representation? Many of the most important and valued metrics we use are built on proportional
models. PageRank, MozRank, CitationFlow and Ahrefs Rank are proportional in nature. The score of any one URL in the data set is relative to the
other URLs in the data set. If the data set is biased, the results are biased.

A Visualization

Link graphs are biased by their crawl prioritization. Because there is no full representation of the Internet, every link graph, even Google’s,
is a biased sample of the web. Imagine for a second that the picture below is of the web. Each dot represents a page on the Internet,
and the dots surrounded by green represent a fictitious index by Google of certain sections of the web.

Of course, Google isn’t the only organization that crawls the web. Other organizations like Moz,
Majestic, Ahrefs, and SEMrush
have their own crawl prioritizations which result in different link indexes.

In the example above, you can see different link providers trying to index the web like Google. Link data provider 1 (purple) does a good job
of building a model that is similar to Google. It isn’t very big, but it is proportional. Link data provider 2 (blue) has a much larger index,
and likely has more links in common with Google that link data provider 1, but it is highly disproportional. So, how would we go about measuring
this proportionality? And which data set is the most proportional to Google?

Methodology

The first step is to determine a measurement of relativity for analysis. Google doesn’t give us very much information about their link graph.
All we have is what is in Google Search Console. The best source we can use is referring domain counts. In particular, we want to look at
what we call
referring domain link pairs. A referring domain link pair would be something like ask.com->mlb.com: 9,444 which means
that ask.com links to mlb.com 9,444 times.

Steps

  1. Determine the root linking domain pairs and values to 100+ sites in Google Search Console
  2. Determine the same for Ahrefs, Moz, Majestic Fresh, Majestic Historic, SEMrush
  3. Compare the referring domain link pairs of each data set to Google, assuming a
    Poisson Distribution
  4. Run simulations of each data set’s performance against each other (ie: Moz vs Maj, Ahrefs vs SEMrush, Moz vs SEMrush, et al.)
  5. Analyze the results

Results

When placed head-to-head, there seem to be some clear winners at first glance. In head-to-head, Moz edges out Ahrefs, but across the board, Moz and Ahrefs fare quite evenly. Moz, Ahrefs and SEMrush seem to be far better than Majestic Fresh and Majestic Historic. Is that really the case? And why?

It turns out there is an inversely proportional relationship between index size and proportional relevancy. This might seem counterintuitive,
shouldn’t the bigger indexes be closer to Google? Not Exactly.

What does this mean?

Each organization has to create a crawl prioritization strategy. When you discover millions of links, you have to prioritize which ones you
might crawl next. Google has a crawl prioritization, so does Moz, Majestic, Ahrefs and SEMrush. There are lots of different things you might
choose to prioritize…

  • You might prioritize link discovery. If you want to build a very large index, you could prioritize crawling pages on sites that
    have historically provided new links.
  • You might prioritize content uniqueness. If you want to build a search engine, you might prioritize finding pages that are unlike
    any you have seen before. You could choose to crawl domains that historically provide unique data and little duplicate content.
  • You might prioritize content freshness. If you want to keep your search engine recent, you might prioritize crawling pages that
    change frequently.
  • You might prioritize content value, crawling the most important URLs first based on the number of inbound links to that page.

Chances are, an organization’s crawl priority will blend some of these features, but it’s difficult to design one exactly like Google. Imagine
for a moment that instead of crawling the web, you want to climb a tree. You have to come up with a tree climbing strategy.

  • You decide to climb the longest branch you see at each intersection.
  • One friend of yours decides to climb the first new branch he reaches, regardless of how long it is.
  • Your other friend decides to climb the first new branch she reaches only if she sees another branch coming off of it.

Despite having different climb strategies, everyone chooses the same first branch, and everyone chooses the same second branch. There are only
so many different options early on.

But as the climbers go further and further along, their choices eventually produce differing results. This is exactly the same for web crawlers
like Google, Moz, Majestic, Ahrefs and SEMrush. The bigger the crawl, the more the crawl prioritization will cause disparities. This is not a
deficiency; this is just the nature of the beast. However, we aren’t completely lost. Once we know how index size is related to disparity, we
can make some inferences about how similar a crawl priority may be to Google.

Unfortunately, we have to be careful in our conclusions. We only have a few data points with which to work, so it is very difficult to be
certain regarding this part of the analysis. In particular, it seems strange that Majestic would get better relative to its index size as it grows,
unless Google holds on to old data (which might be an important discovery in and of itself). It is most likely that at this point we can’t make
this level of conclusion.

So what do we do?

Let’s say you have a list of domains or URLs for which you would like to know their relative values. Your process might look something like
this…

  • Check Open Site Explorer to see if all URLs are in their index. If so, you are looking metrics most likely to be proportional to Google’s link graph.
  • If any of the links do not occur in the index, move to Ahrefs and use their Ahrefs ranking if all you need is a single PageRank-like metric.
  • If any of the links are missing from Ahrefs’s index, or you need something related to trust, move on to Majestic Fresh.
  • Finally, use Majestic Historic for (by leaps and bounds) the largest coverage available.

It is important to point out that the likelihood that all the URLs you want to check are in a single index increases as the accuracy of the metric
decreases. Considering the size of Majestic’s data, you can’t ignore them because you are less likely to get null value answers from their data than
the others. If anything rings true, it is that once again it makes sense to get data
from as many sources as possible. You won’t
get the most proportional data without Moz, the broadest data without Majestic, or everything in-between without Ahrefs.

What about SEMrush? They are making progress, but they don’t publish any relative statistics that would be useful in this particular
case. Maybe we can hope to see more from them soon given their already promising index!

Recommendations for the link graphing industry

All we hear about these days is big data; we almost never hear about good data. I know that the teams at Moz,
Majestic, Ahrefs, SEMrush and others are interested in mimicking Google, but I would love to see some organization stand up against the
allure of
more data in favor of better data—data more like Google’s. It could begin with testing various crawl strategies to see if they produce
a result more similar to that of data shared in Google Search Console. Having the most Google-like data is certainly a crown worth winning.

Credits

Thanks to Diana Carter at Angular for assistance with data acquisition and Andrew Cron with statistical analysis. Thanks also to the representatives from Moz, Majestic, Ahrefs, and SEMrush for answering questions about their indices.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

Calculating Estimated ROI for a Specific Site & Body of Keywords

Posted by shannonskinner

One of the biggest challenges for SEO is proving its worth. We all know it’s valuable, but it’s important to convey its value in terms that key stakeholders (up to and including CEOs) understand. To do that, I put together a process to calculate an estimate of ROI for implementing changes to keyword targeting.

In this post, I will walk through that process, so hopefully you can do the same for your clients (or as an in-house SEO to get buy-in), too!

Overview

  1. Gather your data
    1. Keyword Data
    2. Strength of your Preferred URLs
    3. Competition URLs by Keyword
    4. Strength of Competition URLs
  2. Analyze the Data by Keyword
  3. Calculate your potential opportunity

What you need

There are quite a few parts to this recipe, and while the calculation part is pretty easy, gathering the data to throw in the mix is the challenging part. I’ll list each section here, including the components of each, and then we can go through how to retrieve each of them. 

  • Keyword data

    • list of keywords
    • search volumes for each keyword
    • preferred URLs on the site you’re estimating ROI
    • current rank
    • current ranking URL
  • Strength of your preferred URLs

    • De-duplicated list of preferred URLs
    • Page Authorities for each preferred URL
    • BONUS: External & Internal Links for each URL. You can include any measure you like here, as long as it’s something that can be compared (i.e. a number).
  • Where the competition sits

    • For each keyword, the sites that are ranking 1-10 in search currently
  • Strength of the competition

    • De-duplicated list of competing URLs
    • Page Authorities, Domain Authorities, 
    • BONUS: External & Internal Links, for each competing URL. Include any measure you’ve included on the Strength of Your Preferred URLs list.


How to get what you need


There has been quite a lot written about keyword research, so I won’t go into too much detail here. For the Keyword data list, the important thing is to get whatever keywords you’d like to assess into a spreadsheet, and include all the information listed above. You’ll have to select the preferred URLs based on what you think the strongest-competing and most appropriate URL would be for each keyword. 


For the
Preferred URLs list, you’ll want to use the data that’s in your keyword data under the preferred URL.

  1. Copy the preferred URL data from your Keyword Data into a new tab. 
  2. Use the Remove Duplicates tool (Data>Data Tools in Excel) to remove any duplicated URLs

Once you have the list of de-duplicated preferred URLs, you’ll need to pull the data from Open Site Explorer for these URLs. I prefer using the Moz API with SEOTools. You’ll have to install it to use it for Excel, or if you’d like to take a stab at using it in Google Docs, there are some resources available for that. Unfortunately, with the most recent update to Google Spreadsheets, I’ve had some difficulty with this method, so I’ve gone with Excel for now. 

Once you’ve got SEOTools installed, you can make the call “=MOZ_URLMetrics_toFit([enter your cells])”. This should give you a list of URL titles, canonical URLs, External & Internal links, as well as a few other metrics and DA/PA. 


For the
Where the competition sits list, you’ll first need to perform a search for each of your keywords. Obviously, you could do this manually, or if you have exportable data from a keyword ranking tool and you’ve been ranking the keywords you’d like to look at, you could use either of these methods. If you don’t have those, you can use the hacky method that I did–basically, use the ImportXML command in Google Spreadsheets to grab the top ranking URLs for each query. 

I’ve put a sample version of this together, which you can access here. A few caveats: you should be able to run MANY searches in a row–I had about 850 for my data, and they ran fine. Google will block your IP address, though, if you run too many, and what I found is that I needed to copy out my results as values into a different spreadsheet once I’d gotten them, because they timed out relatively quickly, but you can just put them into the Excel spreadsheet you’re building to make the ROI calculations (you’ll need them there anyway!).


From this list, you can pull each URL into a single list, and de-duplicate as explained for the preferred URLs list to generate the
Strength of the Competition list, and then run the analysis you did with the preferred URLs to generate the same data for these URLs as you did for the preferred URLs with SEOTools for Excel. 


Making your data work for you

Once you’ve got these lists, you can use some VLOOKUP magic to pull in the information you need. I used the
Where the competition sits list as the foundation of my work. 

From there, I pulled in the corresponding preferred URL and its Page Authority, as well as the PAs and DAs for each URL currently ranking 1-10. I then was able to calculate an average PA & DA for each query, and could compare the page I want to rank to this. I estimated the chances that the page I wanted to rank (given that I’d already determined these were relevant pages) could rank with better keyword targeting.

Here’s where things get interesting. You can be rather conservative, and only sum search volumes of keywords you’re fairly confident your site can rank, which is my preferred method. That’s because I use this method primarily to determine if I’m on the right track–whether making these recommendations are really worth the time to get implemented. So I’m going to move forward assuming I’m counting only the search volumes of terms I think I’m quite competitive for, AND that I’m not yet ranking for on page 1. 


Now, you want to move to your analytics data in order to calculate a few things: 

  • Conversion Rate
  • Average order value
  • Previous year’s revenue (for the section you’re looking at)

I’ve set up my sample data in this spreadsheet that you can refer to or use to make your own calculations. 

Each of the assumptions can be adjusted depending on the actual site data, or using estimates. I’m using very very generic overall CTR estimates, but you can select which you’d like and get as granular as you want! The main point for me is really getting to two numbers that I can stand by as pretty good estimates: 

  • Annual Impact (Revenue $$)
  • Increase in Revenue ($$) from last year

This is because, for higher-up folks, money talks. Obviously, this won’t be something you can promise, but it gives them a metric that they understand to really wrap their head around the value that you’re potentially brining to the table if the changes you’re recommending can be made. 

There are some great tools for estimating this kind of stuff on a smaller scale, but for a massive body of keyword data, hopefully you will find this process useful as well. Let me know what you think, and I’d love to see what parts anyone else can streamline or make even more efficient. 

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]