Optimizing AngularJS Single-Page Applications for Googlebot Crawlers

Posted by jrridley

It’s almost certain that you’ve encountered AngularJS on the web somewhere, even if you weren’t aware of it at the time. Here’s a list of just a few sites using Angular:

  • Upwork.com
  • Freelancer.com
  • Udemy.com
  • Youtube.com

Any of those look familiar? If so, it’s because AngularJS is taking over the Internet. There’s a good reason for that: Angular- and other React-style frameworks make for a better user and developer experience on a site. For background, AngularJS and ReactJS are part of a web design movement called single-page applications, or SPAs. While a traditional website loads each individual page as the user navigates the site, including calls to the server and cache, loading resources, and rendering the page, SPAs cut out much of the back-end activity by loading the entire site when a user first lands on a page. Instead of loading a new page each time you click on a link, the site dynamically updates a single HTML page as the user interacts with the site.

Image c/o Microsoft

Why is this movement taking over the Internet? With SPAs, users are treated to a screaming fast site through which they can navigate almost instantaneously, while developers have a template that allows them to customize, test, and optimize pages seamlessly and efficiently. AngularJS and ReactJS use advanced Javascript templates to render the site, which means the HTML/CSS page speed overhead is almost nothing. All site activity runs behind the scenes, out of view of the user.

Unfortunately, anyone who’s tried performing SEO on an Angular or React site knows that the site activity is hidden from more than just site visitors: it’s also hidden from web crawlers. Crawlers like Googlebot rely heavily on HTML/CSS data to render and interpret the content on a site. When that HTML content is hidden behind website scripts, crawlers have no website content to index and serve in search results.

Of course, Google claims they can crawl Javascript (and SEOs have tested and supported this claim), but even if that is true, Googlebot still struggles to crawl sites built on a SPA framework. One of the first issues we encountered when a client first approached us with an Angular site was that nothing beyond the homepage was appearing in the SERPs. ScreamingFrog crawls uncovered the homepage and a handful of other Javascript resources, and that was it.

SF Javascript.png

Another common issue is recording Google Analytics data. Think about it: Analytics data is tracked by recording pageviews every time a user navigates to a page. How can you track site analytics when there’s no HTML response to trigger a pageview?

After working with several clients on their SPA websites, we’ve developed a process for performing SEO on those sites. By using this process, we’ve not only enabled SPA sites to be indexed by search engines, but even to rank on the first page for keywords.

5-step solution to SEO for AngularJS

  1. Make a list of all pages on the site
  2. Install Prerender
  3. “Fetch as Google”
  4. Configure Analytics
  5. Recrawl the site

1) Make a list of all pages on your site

If this sounds like a long and tedious process, that’s because it definitely can be. For some sites, this will be as easy as exporting the XML sitemap for the site. For other sites, especially those with hundreds or thousands of pages, creating a comprehensive list of all the pages on the site can take hours or days. However, I cannot emphasize enough how helpful this step has been for us. Having an index of all pages on the site gives you a guide to reference and consult as you work on getting your site indexed. It’s almost impossible to predict every issue that you’re going to encounter with an SPA, and if you don’t have an all-inclusive list of content to reference throughout your SEO optimization, it’s highly likely you’ll leave some part of the site un-indexed by search engines inadvertently.

One solution that might enable you to streamline this process is to divide content into directories instead of individual pages. For example, if you know that you have a list of storeroom pages, include your /storeroom/ directory and make a note of how many pages that includes. Or if you have an e-commerce site, make a note of how many products you have in each shopping category and compile your list that way (though if you have an e-commerce site, I hope for your own sake you have a master list of products somewhere). Regardless of what you do to make this step less time-consuming, make sure you have a full list before continuing to step 2.

2) Install Prerender

Prerender is going to be your best friend when performing SEO for SPAs. Prerender is a service that will render your website in a virtual browser, then serve the static HTML content to web crawlers. From an SEO standpoint, this is as good of a solution as you can hope for: users still get the fast, dynamic SPA experience while search engine crawlers can identify indexable content for search results.

Prerender’s pricing varies based on the size of your site and the freshness of the cache served to Google. Smaller sites (up to 250 pages) can use Prerender for free, while larger sites (or sites that update constantly) may need to pay as much as $200+/month. However, having an indexable version of your site that enables you to attract customers through organic search is invaluable. This is where that list you compiled in step 1 comes into play: if you can prioritize what sections of your site need to be served to search engines, or with what frequency, you may be able to save a little bit of money each month while still achieving SEO progress.

3) “Fetch as Google”

Within Google Search Console is an incredibly useful feature called “Fetch as Google.” “Fetch as Google” allows you to enter a URL from your site and fetch it as Googlebot would during a crawl. “Fetch” returns the HTTP response from the page, which includes a full download of the page source code as Googlebot sees it. “Fetch and Render” will return the HTTP response and will also provide a screenshot of the page as Googlebot saw it and as a site visitor would see it.

This has powerful applications for AngularJS sites. Even with Prerender installed, you may find that Google is still only partially displaying your website, or it may be omitting key features of your site that are helpful to users. Plugging the URL into “Fetch as Google” will let you review how your site appears to search engines and what further steps you may need to take to optimize your keyword rankings. Additionally, after requesting a “Fetch” or “Fetch and Render,” you have the option to “Request Indexing” for that page, which can be handy catalyst for getting your site to appear in search results.

4) Configure Google Analytics (or Google Tag Manager)

As I mentioned above, SPAs can have serious trouble with recording Google Analytics data since they don’t track pageviews the way a standard website does. Instead of the traditional Google Analytics tracking code, you’ll need to install Analytics through some kind of alternative method.

One method that works well is to use the Angulartics plugin. Angulartics replaces standard pageview events with virtual pageview tracking, which tracks the entire user navigation across your application. Since SPAs dynamically load HTML content, these virtual pageviews are recorded based on user interactions with the site, which ultimately tracks the same user behavior as you would through traditional Analytics. Other people have found success using Google Tag Manager “History Change” triggers or other innovative methods, which are perfectly acceptable implementations. As long as your Google Analytics tracking records user interactions instead of conventional pageviews, your Analytics configuration should suffice.

5) Recrawl the site

After working through steps 1–4, you’re going to want to crawl the site yourself to find those errors that not even Googlebot was anticipating. One issue we discovered early with a client was that after installing Prerender, our crawlers were still running into a spider trap:

As you can probably tell, there were not actually 150,000 pages on that particular site. Our crawlers just found a recursive loop that kept generating longer and longer URL strings for the site content. This is something we would not have found in Google Search Console or Analytics. SPAs are notorious for causing tedious, inexplicable issues that you’ll only uncover by crawling the site yourself. Even if you follow the steps above and take as many precautions as possible, I can still almost guarantee you will come across a unique issue that can only be diagnosed through a crawl.

If you’ve come across any of these unique issues, let me know in the comments! I’d love to hear what other issues people have encountered with SPAs.

Results

As I mentioned earlier in the article, the process outlined above has enabled us to not only get client sites indexed, but even to get those sites ranking on first page for various keywords. Here’s an example of the keyword progress we made for one client with an AngularJS site:

Also, the organic traffic growth for that client over the course of seven months:

All of this goes to show that although SEO for SPAs can be tedious, laborious, and troublesome, it is not impossible. Follow the steps above, and you can have SEO success with your single-page app website.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 2 years ago from tracking.feedpress.it

Stop Ghost Spam in Google Analytics with One Filter

Posted by CarloSeo

The spam in Google Analytics (GA) is becoming a serious issue. Due to a deluge of referral spam from social buttons, adult sites, and many, many other sources, people are starting to become overwhelmed by all the filters they are setting up to manage the useless data they are receiving.

The good news is, there is no need to panic. In this post, I’m going to focus on the most common mistakes people make when fighting spam in GA, and explain an efficient way to prevent it.

But first, let’s make sure we understand how spam works. A couple of months ago, Jared Gardner wrote an excellent article explaining what referral spam is, including its intended purpose. He also pointed out some great examples of referral spam.

Types of spam

The spam in Google Analytics can be categorized by two types: ghosts and crawlers.

Ghosts

The vast majority of spam is this type. They are called ghosts because they never access your site. It is important to keep this in mind, as it’s key to creating a more efficient solution for managing spam.

As unusual as it sounds, this type of spam doesn’t have any interaction with your site at all. You may wonder how that is possible since one of the main purposes of GA is to track visits to our sites.

They do it by using the Measurement Protocol, which allows people to send data directly to Google Analytics’ servers. Using this method, and probably randomly generated tracking codes (UA-XXXXX-1) as well, the spammers leave a “visit” with fake data, without even knowing who they are hitting.

Crawlers

This type of spam, the opposite to ghost spam, does access your site. As the name implies, these spam bots crawl your pages, ignoring rules like those found in robots.txt that are supposed to stop them from reading your site. When they exit your site, they leave a record on your reports that appears similar to a legitimate visit.

Crawlers are harder to identify because they know their targets and use real data. But it is also true that new ones seldom appear. So if you detect a referral in your analytics that looks suspicious, researching it on Google or checking it against this list might help you answer the question of whether or not it is spammy.

Most common mistakes made when dealing with spam in GA

I’ve been following this issue closely for the last few months. According to the comments people have made on my articles and conversations I’ve found in discussion forums, there are primarily three mistakes people make when dealing with spam in Google Analytics.

Mistake #1. Blocking ghost spam from the .htaccess file

One of the biggest mistakes people make is trying to block Ghost Spam from the .htaccess file.

For those who are not familiar with this file, one of its main functions is to allow/block access to your site. Now we know that ghosts never reach your site, so adding them here won’t have any effect and will only add useless lines to your .htaccess file.

Ghost spam usually shows up for a few days and then disappears. As a result, sometimes people think that they successfully blocked it from here when really it’s just a coincidence of timing.

Then when the spammers later return, they get worried because the solution is not working anymore, and they think the spammer somehow bypassed the barriers they set up.

The truth is, the .htaccess file can only effectively block crawlers such as buttons-for-website.com and a few others since these access your site. Most of the spam can’t be blocked using this method, so there is no other option than using filters to exclude them.

Mistake #2. Using the referral exclusion list to stop spam

Another error is trying to use the referral exclusion list to stop the spam. The name may confuse you, but this list is not intended to exclude referrals in the way we want to for the spam. It has other purposes.

For example, when a customer buys something, sometimes they get redirected to a third-party page for payment. After making a payment, they’re redirected back to you website, and GA records that as a new referral. It is appropriate to use referral exclusion list to prevent this from happening.

If you try to use the referral exclusion list to manage spam, however, the referral part will be stripped since there is no preexisting record. As a result, a direct visit will be recorded, and you will have a bigger problem than the one you started with since. You will still have spam, and direct visits are harder to track.

Mistake #3. Worrying that bounce rate changes will affect rankings

When people see that the bounce rate changes drastically because of the spam, they start worrying about the impact that it will have on their rankings in the SERPs.

bounce.png

This is another mistake commonly made. With or without spam, Google doesn’t take into consideration Google Analytics metrics as a ranking factor. Here is an explanation about this from Matt Cutts, the former head of Google’s web spam team.

And if you think about it, Cutts’ explanation makes sense; because although many people have GA, not everyone uses it.

Assuming your site has been hacked

Another common concern when people see strange landing pages coming from spam on their reports is that they have been hacked.

landing page

The page that the spam shows on the reports doesn’t exist, and if you try to open it, you will get a 404 page. Your site hasn’t been compromised.

But you have to make sure the page doesn’t exist. Because there are cases (not spam) where some sites have a security breach and get injected with pages full of bad keywords to defame the website.

What should you worry about?

Now that we’ve discarded security issues and their effects on rankings, the only thing left to worry about is your data. The fake trail that the spam leaves behind pollutes your reports.

It might have greater or lesser impact depending on your site traffic, but everyone is susceptible to the spam.

Small and midsize sites are the most easily impacted – not only because a big part of their traffic can be spam, but also because usually these sites are self-managed and sometimes don’t have the support of an analyst or a webmaster.

Big sites with a lot of traffic can also be impacted by spam, and although the impact can be insignificant, invalid traffic means inaccurate reports no matter the size of the website. As an analyst, you should be able to explain what’s going on in even in the most granular reports.

You only need one filter to deal with ghost spam

Usually it is recommended to add the referral to an exclusion filter after it is spotted. Although this is useful for a quick action against the spam, it has three big disadvantages.

  • Making filters every week for every new spam detected is tedious and time-consuming, especially if you manage many sites. Plus, by the time you apply the filter, and it starts working, you already have some affected data.
  • Some of the spammers use direct visits along with the referrals.
  • These direct hits won’t be stopped by the filter so even if you are excluding the referral you will sill be receiving invalid traffic, which explains why some people have seen an unusual spike in direct traffic.

Luckily, there is a good way to prevent all these problems. Most of the spam (ghost) works by hitting GA’s random tracking-IDs, meaning the offender doesn’t really know who is the target, and for that reason either the hostname is not set or it uses a fake one. (See report below)

Ghost-Spam.png

You can see that they use some weird names or don’t even bother to set one. Although there are some known names in the list, these can be easily added by the spammer.

On the other hand, valid traffic will always use a real hostname. In most of the cases, this will be the domain. But it also can also result from paid services, translation services, or any other place where you’ve inserted GA tracking code.

Valid-Referral.png

Based on this, we can make a filter that will include only hits that use real hostnames. This will automatically exclude all hits from ghost spam, whether it shows up as a referral, keyword, or pageview; or even as a direct visit.

To create this filter, you will need to find the report of hostnames. Here’s how:

  1. Go to the Reporting tab in GA
  2. Click on Audience in the lefthand panel
  3. Expand Technology and select Network
  4. At the top of the report, click on Hostname

Valid-list

You will see a list of all hostnames, including the ones that the spam uses. Make a list of all the valid hostnames you find, as follows:

  • yourmaindomain.com
  • blog.yourmaindomain.com
  • es.yourmaindomain.com
  • payingservice.com
  • translatetool.com
  • anotheruseddomain.com

For small to medium sites, this list of hostnames will likely consist of the main domain and a couple of subdomains. After you are sure you got all of them, create a regular expression similar to this one:

yourmaindomain\.com|anotheruseddomain\.com|payingservice\.com|translatetool\.com

You don’t need to put all of your subdomains in the regular expression. The main domain will match all of them. If you don’t have a view set up without filters, create one now.

Then create a Custom Filter.

Make sure you select INCLUDE, then select “Hostname” on the filter field, and copy your expression into the Filter Pattern box.

filter

You might want to verify the filter before saving to check that everything is okay. Once you’re ready, set it to save, and apply the filter to all the views you want (except the view without filters).

This single filter will get rid of future occurrences of ghost spam that use invalid hostnames, and it doesn’t require much maintenance. But it’s important that every time you add your tracking code to any service, you add it to the end of the filter.

Now you should only need to take care of the crawler spam. Since crawlers access your site, you can block them by adding these lines to the .htaccess file:

## STOP REFERRER SPAM 
RewriteCond %{HTTP_REFERER} semalt\.com [NC,OR] 
RewriteCond %{HTTP_REFERER} buttons-for-website\.com [NC] 
RewriteRule .* - [F]

It is important to note that this file is very sensitive, and misplacing a single character it it can bring down your entire site. Therefore, make sure you create a backup copy of your .htaccess file prior to editing it.

If you don’t feel comfortable messing around with your .htaccess file, you can alternatively make an expression with all the crawlers, then and add it to an exclude filter by Campaign Source.

Implement these combined solutions, and you will worry much less about spam contaminating your analytics data. This will have the added benefit of freeing up more time for you to spend actually analyze your valid data.

After stopping spam, you can also get clean reports from the historical data by using the same expressions in an Advance Segment to exclude all the spam.

Bonus resources to help you manage spam

If you still need more information to help you understand and deal with the spam on your GA reports, you can read my main article on the subject here: http://www.ohow.co/what-is-referrer-spam-how-stop-it-guide/.

Additional information on how to stop spam can be found at these URLs:

In closing, I am eager to hear your ideas on this serious issue. Please share them in the comments below.

(Editor’s Note: All images featured in this post were created by the author.)

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Controlling Search Engine Crawlers for Better Indexation and Rankings – Whiteboard Friday

Posted by randfish

When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today’s Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don’t want them to go. We’re going to talk a little bit about crawl budget and what you should and shouldn’t have indexed.

As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.

There are a few others that we’re going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let’s dive into those first few first.

Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn’t access, it doesn’t always get respected by Google and Bing. So a lot of folks when you say, “hey, disallow this,” and then you suddenly see those URLs popping up and you’re wondering what’s going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you’ve made a mistake, they think “hey, there’s a lot of links pointing to this content, there’s a lot of people who are visiting and caring about this content, maybe you didn’t intend for us to block it.” The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say “everything behind this entire big directory,” the worse they are about necessarily believing you.

Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it’s usually a lot more respected, because it’s at an individual-page level; Google and Bing tend to believe you about the meta robots tag.

And then the nofollow tag, that lives on an individual link on a page. It doesn’t tell engines where to crawl or not to crawl. All it’s saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.

Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.

What if, for example, we take a page like “blogtest.html” on our domain and we say “all user agents, you are not allowed to crawl blogtest.html. Okay—that’s a good way to keep that page away from being crawled, but just because something is not crawled doesn’t necessarily mean it won’t be in the search results.

So then we have our SEO folks go, “you know what, let’s make doubly sure that doesn’t show up in search results; we’ll put in the meta robots tag:”

<meta name="robots" content="noindex, follow">

So, “noindex, follow” tells the search engine crawler they can follow the links on the page, but they shouldn’t index this particular one.

Then, you go and run a search for “blog test” in this case, and everybody on the team’s like “What the heck!? WTF? Why am I seeing this page show up in search results?”

The answer is, you told the engines that they couldn’t crawl the page, so they didn’t. But they are still putting it in the results. They’re actually probably not going to include a meta description; they might have something like “we can’t include a meta description because of this site’s robots.txt file.” The reason it’s showing up is because they can’t see the noindex; all they see is the disallow.

So, if you want something truly removed, unable to be seen in search results, you can’t just disallow a crawler. You have to say meta “noindex” and you have to let them crawl it.

So this creates some complications. Robots.txt can be great if we’re trying to save crawl bandwidth, but it isn’t necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying “Google, don’t crawl the www version of twitter.com.” What you should be doing is rel canonical-ing or using a 301.

Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.

The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.

Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404’d something or if you’ve told them not to crawl something but it’s still showing up in there, you can manually say “don’t do that.” There are a few other crawl protocol things that you can do.

And then URL status codes—these are a valid way to do things, but they’re going to obviously change what’s going on on your pages, too.

If you’re not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines “it’s back!” 410 is permanent removal.

301—permanent redirect, we’ve talked about those here—and 302, temporary redirect.

Now let’s jump into a few specific use cases of “what kinds of content should and shouldn’t I allow engines to crawl and index” in this next version…

[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]

Four crawling/indexing problems to solve

So we’ve got these four big problems that I want to talk about as they relate to crawling and indexing.

1. Content that isn’t ready yet

The first one here is around, “If I have content of quality I’m still trying to improve—it’s not yet ready for primetime, it’s not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I’m rewriting the content and creating unique value on those pages… they’re just not ready yet—what should I do with those?”

My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I’d disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.

If I’m talking about a small quantity—a few dozen, a few hundred pages—well, I’d probably just use the meta robots noindex, and then I’d pull that noindex off of those pages as they are made ready for Google’s consumption. And then again, I would probably use the XML sitemap and start submitting those once they’re ready.

2. Dealing with duplicate or thin content

What about, “Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?” I’ve got an example. Let’s say I’m an ecommerce shop, I’m selling this nice Star Wars t-shirt which I think is kind of hilarious, so I’ve got starwarsshirt.html, and it links out to a larger version of an image, and that’s an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn’t recommend disallowing crawling on these, and I wouldn’t recommend noindexing them. What I would do there is a rel canonical.

Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn’t see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don’t get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta “noindex, follow” on these pages, but I don’t really think that’s necessary, and again that might interfere with the rel canonical.

3. Passing link equity without appearing in search results

Number three: “If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don’t need those appearing in search results—what should I use then?”

What I would say here is, you can use the meta robots to say “don’t index the page, but do follow the links that are on that page.” That’s a pretty nice, handy use case for that.

Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can’t see the noindex. They don’t know that they can follow it. Granted, as we talked about before, sometimes Google doesn’t obey the robots.txt, but you can’t rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots “noindex, follow” is the way to do this.

4. Search results-type pages

Finally, fourth, “What should I do with search results-type pages?” Google has said many times that they don’t like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.

Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you’ve got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, “I’m looking for restaurants in Seattle, WA,” they’ll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that’s not just a list of search results, that is more of a landing page than a search results page.

However, that being said, if you’ve got a long tail of these, or if you’d say “hey, our internal search engine, that’s really for internal visitors only—it’s not useful to have those pages show up in search results, and we don’t think we need to make the effort to make those into category landing pages.” Then you can use the disallow in robots.txt to prevent those.

Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren’t valuable pages that should be served up and turned into landing pages. If you’re sure, then go ahead and disallow all your search results-style pages. You’ll see a lot of sites doing this in their robots.txt file.

That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I’ll try and tackle those in the comments below.

We’ll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Big Data, Big Problems: 4 Major Link Indexes Compared

Posted by russangular

Given this blog’s readership, chances are good you will spend some time this week looking at backlinks in one of the growing number of link data tools. We know backlinks continue to be one of, if not the most important
parts of Google’s ranking algorithm. We tend to take these link data sets at face value, though, in part because they are all we have. But when your rankings are on the line, is there a better way to get at which data set is the best? How should we go
about assessing these different link indexes like
Moz,
Majestic, Ahrefs and SEMrush for quality? Historically, there have been 4 common approaches to this question of index quality…

  • Breadth: We might choose to look at the number of linking root domains any given service reports. We know
    that referring domains correlates strongly with search rankings, so it makes sense to judge a link index by how many unique domains it has
    discovered and indexed.
  • Depth: We also might choose to look at how deep the web has been crawled, looking more at the total number of URLs
    in the index, rather than the diversity of referring domains.
  • Link Overlap: A more sophisticated approach might count the number of links an index has in common with Google Webmaster
    Tools.
  • Freshness: Finally, we might choose to look at the freshness of the index. What percentage of links in the index are
    still live?

There are a number of really good studies (some newer than others) using these techniques that are worth checking out when you get a chance:

  • BuiltVisible analysis of Moz, Majestic, GWT, Ahrefs and Search Metrics
  • SEOBook comparison of Moz, Majestic, Ahrefs, and Ayima
  • MatthewWoodward
    study of Ahrefs, Majestic, Moz, Raven and SEO Spyglass
  • Marketing Signals analysis of Moz, Majestic, Ahrefs, and GWT
  • RankAbove comparison of Moz, Majestic, Ahrefs and Link Research Tools
  • StoneTemple study of Moz and Majestic

While these are all excellent at addressing the methodologies above, there is a particular limitation with all of them. They miss one of the
most important metrics we need to determine the value of a link index: proportional representation to Google’s link graph
. So here at Angular Marketing, we decided to take a closer look.

Proportional representation to Google Search Console data

So, why is it important to determine proportional representation? Many of the most important and valued metrics we use are built on proportional
models. PageRank, MozRank, CitationFlow and Ahrefs Rank are proportional in nature. The score of any one URL in the data set is relative to the
other URLs in the data set. If the data set is biased, the results are biased.

A Visualization

Link graphs are biased by their crawl prioritization. Because there is no full representation of the Internet, every link graph, even Google’s,
is a biased sample of the web. Imagine for a second that the picture below is of the web. Each dot represents a page on the Internet,
and the dots surrounded by green represent a fictitious index by Google of certain sections of the web.

Of course, Google isn’t the only organization that crawls the web. Other organizations like Moz,
Majestic, Ahrefs, and SEMrush
have their own crawl prioritizations which result in different link indexes.

In the example above, you can see different link providers trying to index the web like Google. Link data provider 1 (purple) does a good job
of building a model that is similar to Google. It isn’t very big, but it is proportional. Link data provider 2 (blue) has a much larger index,
and likely has more links in common with Google that link data provider 1, but it is highly disproportional. So, how would we go about measuring
this proportionality? And which data set is the most proportional to Google?

Methodology

The first step is to determine a measurement of relativity for analysis. Google doesn’t give us very much information about their link graph.
All we have is what is in Google Search Console. The best source we can use is referring domain counts. In particular, we want to look at
what we call
referring domain link pairs. A referring domain link pair would be something like ask.com->mlb.com: 9,444 which means
that ask.com links to mlb.com 9,444 times.

Steps

  1. Determine the root linking domain pairs and values to 100+ sites in Google Search Console
  2. Determine the same for Ahrefs, Moz, Majestic Fresh, Majestic Historic, SEMrush
  3. Compare the referring domain link pairs of each data set to Google, assuming a
    Poisson Distribution
  4. Run simulations of each data set’s performance against each other (ie: Moz vs Maj, Ahrefs vs SEMrush, Moz vs SEMrush, et al.)
  5. Analyze the results

Results

When placed head-to-head, there seem to be some clear winners at first glance. In head-to-head, Moz edges out Ahrefs, but across the board, Moz and Ahrefs fare quite evenly. Moz, Ahrefs and SEMrush seem to be far better than Majestic Fresh and Majestic Historic. Is that really the case? And why?

It turns out there is an inversely proportional relationship between index size and proportional relevancy. This might seem counterintuitive,
shouldn’t the bigger indexes be closer to Google? Not Exactly.

What does this mean?

Each organization has to create a crawl prioritization strategy. When you discover millions of links, you have to prioritize which ones you
might crawl next. Google has a crawl prioritization, so does Moz, Majestic, Ahrefs and SEMrush. There are lots of different things you might
choose to prioritize…

  • You might prioritize link discovery. If you want to build a very large index, you could prioritize crawling pages on sites that
    have historically provided new links.
  • You might prioritize content uniqueness. If you want to build a search engine, you might prioritize finding pages that are unlike
    any you have seen before. You could choose to crawl domains that historically provide unique data and little duplicate content.
  • You might prioritize content freshness. If you want to keep your search engine recent, you might prioritize crawling pages that
    change frequently.
  • You might prioritize content value, crawling the most important URLs first based on the number of inbound links to that page.

Chances are, an organization’s crawl priority will blend some of these features, but it’s difficult to design one exactly like Google. Imagine
for a moment that instead of crawling the web, you want to climb a tree. You have to come up with a tree climbing strategy.

  • You decide to climb the longest branch you see at each intersection.
  • One friend of yours decides to climb the first new branch he reaches, regardless of how long it is.
  • Your other friend decides to climb the first new branch she reaches only if she sees another branch coming off of it.

Despite having different climb strategies, everyone chooses the same first branch, and everyone chooses the same second branch. There are only
so many different options early on.

But as the climbers go further and further along, their choices eventually produce differing results. This is exactly the same for web crawlers
like Google, Moz, Majestic, Ahrefs and SEMrush. The bigger the crawl, the more the crawl prioritization will cause disparities. This is not a
deficiency; this is just the nature of the beast. However, we aren’t completely lost. Once we know how index size is related to disparity, we
can make some inferences about how similar a crawl priority may be to Google.

Unfortunately, we have to be careful in our conclusions. We only have a few data points with which to work, so it is very difficult to be
certain regarding this part of the analysis. In particular, it seems strange that Majestic would get better relative to its index size as it grows,
unless Google holds on to old data (which might be an important discovery in and of itself). It is most likely that at this point we can’t make
this level of conclusion.

So what do we do?

Let’s say you have a list of domains or URLs for which you would like to know their relative values. Your process might look something like
this…

  • Check Open Site Explorer to see if all URLs are in their index. If so, you are looking metrics most likely to be proportional to Google’s link graph.
  • If any of the links do not occur in the index, move to Ahrefs and use their Ahrefs ranking if all you need is a single PageRank-like metric.
  • If any of the links are missing from Ahrefs’s index, or you need something related to trust, move on to Majestic Fresh.
  • Finally, use Majestic Historic for (by leaps and bounds) the largest coverage available.

It is important to point out that the likelihood that all the URLs you want to check are in a single index increases as the accuracy of the metric
decreases. Considering the size of Majestic’s data, you can’t ignore them because you are less likely to get null value answers from their data than
the others. If anything rings true, it is that once again it makes sense to get data
from as many sources as possible. You won’t
get the most proportional data without Moz, the broadest data without Majestic, or everything in-between without Ahrefs.

What about SEMrush? They are making progress, but they don’t publish any relative statistics that would be useful in this particular
case. Maybe we can hope to see more from them soon given their already promising index!

Recommendations for the link graphing industry

All we hear about these days is big data; we almost never hear about good data. I know that the teams at Moz,
Majestic, Ahrefs, SEMrush and others are interested in mimicking Google, but I would love to see some organization stand up against the
allure of
more data in favor of better data—data more like Google’s. It could begin with testing various crawl strategies to see if they produce
a result more similar to that of data shared in Google Search Console. Having the most Google-like data is certainly a crown worth winning.

Credits

Thanks to Diana Carter at Angular for assistance with data acquisition and Andrew Cron with statistical analysis. Thanks also to the representatives from Moz, Majestic, Ahrefs, and SEMrush for answering questions about their indices.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

The #LocalUp Advanced 2015 Agenda Is Here

Posted by EricaMcGillivray

You may heard that in partnership with 
Local U, we’re putting on a local SEO conference called LocalUp Advanced on Saturday, February 7. We’re super-thrilled to be able to dive more into the local SEO space and bring you top speakers in the field for a one-day knowledge explosion. We’re expecting around 125-150 people at our Seattle headquarters, so this is your chance to really chat with speakers and attendees one-to-one with a huge return on investment.

Moz Pro or Local U Subscribers $699

General Admission $999


LocalUp Advanced 2015 Agenda


8:00-9:00am Breakfast
9:00-9:05am Welcome to LocalUp Advanced 2015! with David Mihm
9:05-9:30am

Pigeons, Packs, & Paid: Google Local 2015 with Dr. Pete Meyers
In the past year, Google shook the local SEO world with the Pigeon update, rolled out an entirely new local pack, and has aggressively dabbled in local advertising. Dr. Pete covers the year in review, how it’s impacted the local landscape, and what to expect in 2015.

Dr. Pete Meyers is the Marketing Scientist for Moz, where he works with the marketing and data science teams on product research and data-driven content. He’s spent the past two years building research tools to monitor Google, including the MozCast project, and he curates the Google Algorithm History.

Pete Meyers

9:30-9:55am

Local Battlegrounds – Tactics, Trenches, and Ghosts with Mike Blumenthal
Join Professor Maps and take a ride in the Way Back Whacky Machine to look at Google’s technologies, tactics, and play books used to create, shape, and dominate the local ecosystem in their image. Learn what’s relevant to marketing today and how these changes are shaping Google’s coming battles in the space.

If you’re in Local, then you know Mike Blumenthal, and here is your chance to learn from this pioneer in local SEO, whose years of industry research and documentation have earned him the fond and respectful nickname ‘Professor Maps.’ Mike’s blog has been the go-to spot for local SEOs since the early days of Google Maps. It’s safe to say that there are few people on the planet who know more about this area of marketing than Mike. He’s also the co-founder of GetFiveStars, an innovative review and testimonial software. Additionally, Mike loves biking, x-country skiing, and home cooking.

Mike Blumenthal

9:55-10:10am Q&A with Dr. Peter Meyers and Mike Blumenthal
10:10-10:45am

Going Local with Google with Jade Wang
Learn about local search with Google. We’ll chat about the potential of local search and discuss how business information gets on Google.

If you’ve gone to the Google and Your Business Forum for help (and, of course, you have!), then you know how quickly an answer from Google staffer Jade Wang can clear up even the toughest problems. She has been helping business owners get their information listed on Google since joining the team in 2012.

Jade Wang

10:45-11:05am AM Break
11:05-11:25am

Getting Local Keyword Research and On-page Optimization Right with Mary Bowling
Local keyword data is often difficult to find, analyze, and prioritize. Get tips, tools, and processes for zeroing in on the best terms to target when optimizing your website and directory listings, and learn how and why to structure your website around them.

Mary Bowling’s been specializing in SEO and local search since 2003. She works as a consultant at Optimized!, is a partner at a small agency called Ignitor Digital, is a partner in Local U, and is also a trainer and writer for Search Engine News. Mary spends her days interacting directly with local business owners and understands holistic local needs.

Mary Bowling

11:25-11:50am

Local Content + Scale + Creativity = Awesome with Mike Ramsey
If you are wondering who is crushing it with local content and how you can scale such efforts, then tune in as Mike Ramsey walks through ideas, examples, and lessons he has learned along the way.

Mike Ramsey is the president of Nifty Marketing with offices in Burley and Boise, Idaho. He is also a Partner at Local U and many other ventures. Mike has an awesome wife and three kids who put up with all his talk about search.

Mike Ramsey

11:50am-12:15pm

Review Acquisition Strategies That Work with Darren Shaw
Darren Shaw will walk you through multiple real-world examples of businesses that are killing it with review acquisition. He’ll detail exactly how they manage to get so many more reviews than their competitors and how you can use their methods to improve your own local search visibility.

Darren Shaw is the President and Founder of Whitespark, a company that builds software and provides services to help businesses with local search. He’s widely regarded in the local SEO community as an innovator, one whose years of experience working with massive local data sets have given him uncommon insights into the inner workings of the world of citation-building and local search marketing. Darren has been working on the web for over 16 years and loves everything about local SEO.

Mike Ramsey

12:15-12:30pm Q&A with Mary Bowling, Mike Ramsey, and Darren Shaw
12:30-1:30pm Lunch
1:30-1:55pm

The Down-Low on LoMo (Local Mobile) SEO with Cindy Krum
Half of all local searches happen on mobile, and that stat is just growing! Map search results are great, but your mobile site has to be great too. Cindy Krum will review the best practices for making your local site look perfect to mobile users and crawlers alike. No mobile site? No problem as you’ll also get tips for how to make the most of mobile searches without one.

Cindy Krum is the CEO and Founder of MobileMoxie, LLC, a mobile marketing consultancy and host of the most cutting-edge online mobile marketing toolset available today. Cindy is the author of Mobile Marketing: Finding Your Customers No Matter Where They Are, published by Que Publishing.

Cindy Krum

1:55-2:20pm

Thriving in the Mobile Ecosystem with Aaron Weiche
A look into the opportunity of creating and growing the mobile experience between your customers and your brand: one strong enough to delight fingers, change minds, and win hearts.

Aaron Weiche is a digital marketing geek focused on web design, mobile, and search marketing. Aaron is the COO of Spyder Trap in Minneapolis, Local U faculty member, founding board member of MnSearch, and a Local Search Ranking Factors Contributor since 2010.

Aaron Weiche

2:20-2:45pm

Content, Conversations, and Conversions with Will Scott
How local businesses, and the marketers who love them, can use social media to bring home the bacon.

Helping small businesses succeed online since 1994, Will Scott has led teams responsible for thousands of websites, hundreds of thousands of pages in online directories, and millions of visits from search. Today, Will leads nearly 100 professionals at Search Influence putting results first and helping customers successfully market online.

Will Scott

2:45-3:10pm

Segmentation Domination with Ed Reese
Learn how to gain powerful insight by creating creative custom segments in Google Analytics. This session shows several real-world examples in action and walks you through the brainstorming, implementation, and discovery process to utilize segmentation like never before.

Ed Reese leads a talented analytics and usability team at his firm Sixth Man Marketing, is a co-founder of Local U, and an adjunct professor of digital marketing at Gonzaga University. In his free time, he optimizes his foosball and disc golf technique and spends time with his wife and two boys.

Ed Reese

3:10-3:30pm PM Break
3:30-4:00pm

Playing to Your Local Strengths with David Mihm
Historically, local search has been one of the most level playing fields on the web with smaller, nimbler businesses having an advantage as larger enterprises struggled to adapt and keep up. Today, companies of both sizes can benefit from tactics that the other simply can’t leverage. David will share some of the most valuable tactics that scale—and don’t scale—in a presentation packed with actionable takeaways, no matter what size business you work with.

David Mihm is one of the world’s leading practitioners of local search engine marketing. He has created and promoted search-friendly websites for clients of all sizes since the early 2000s. David co-founded GetListed.org, which he sold to Moz in November 2012. Since then, he’s served as our Director of Local Search Marketing, imparting his wisdom everywhere!

David Mihm

4:00-4:25pm

Don’t Just Show Up, Stand Out with Dana DiTomaso
Learn how to destroy your competitors with bringing personality to your marketing. Confront the challenges of making HIPPOs comfortable with unique voice, keep brand standards while injecting some fun, and stay in the forefront of your audience’s mind.

Whether at a conference, on the radio, or in a meeting, Dana DiTomaso likes to impart wisdom to help you turn a lot of marketing BS into real strategies to grow your business. After 10+ years and with a focus on local SMBs, she’s seen (almost) everything. In her spare time, Dana drinks tea and yells at the Hamilton Tiger-Cats.

Dana DiTomaso

4:25-4:40pm Q&A with David Mihm and Dana DiTomaso
4:40-5:20pm

Exposing the Non-Obvious Elements of Local Businesses That Dominate on the Web with Rand Fishkin
In some categories and geographies, a local small business wholly dominates the rankings and visibility across channels. What are the secrets to this success, and how can small businesses with remarkable products/services showcase their traits best online? In this presentation, Rand will dig deep into examples and highlight the recurring elements that help the best of the best stand out.

Rand Fishkin is the founder of Moz. Traveler, blogger, social media addict, feminist, and husband.

Rand Fishkin

And if that doesn’t quite tickle your fancy… Workshops!

We’ll also be hosting workshops with our speakers, which are amazing opportunities for you to dig into your specific questions and issues. I know, sometimes I get a little shy to ask questions in front of a crowd or just want to socialize at the after party, so this a great opportunity to get direct feedback.

Time Workshop Option A Workshop Option B
1:30-1:55pm

Reporting Q&A with Ed Reese and Dana DiTomaso
Need help with your reporting? Ed and Dana will make sure you’re on the right track and tracking the right things.

Google My Business Q&A with Jade Wang
Google My Business can be confusing, but Jade Wang is here to lend a hand. She’ll look over your specific problems and help you troubleshoot.

1:55-2:20pm

How to Troubleshoot All Things Local with Mike Blumenthal and Mary Bowling
No Local SEO problem can get by the combined powers of Mike and Mary. This dynamic duo will assist you in diving into your specific questions, problems, and concerns.

Google My Business Q&A with Jade Wang
Google My Business can be confusing, but Jade Wang is here to lend a hand. She’ll look over your specific problems and help you troubleshoot.

2:20-2:45pm

Citation Q&A with David Mihm and Darren Shaw
Getting the right citations for your business can be a powerful boost. David and Darren will show you how to wield citations correctly and creatively for your business.

Google My Business Q&A with Jade Wang
Google My Business can be confusing, but Jade Wang is here to lend a hand. She’ll look over your specific problems and help you troubleshoot.

2:45-3:10pm

Mobile Q&A with Aaron Weiche and Cindy Krum
Local and mobile go hand-in-hand, but mobile implementation, optimization, and perfection can be tricky. Aaron and Cindy will help guide you and your business.

Google My Business Q&A with Jade Wang
Google My Business can be confusing, but Jade Wang is here to lend a hand. She’ll look over your specific problems and help you troubleshoot.


See you in February, friends. And please, don’t hesitate to reach out if you have any questions!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 5 years ago from moz.com

The Final Word – The Ultimate WP SEO plugin

The FInal Word is the first –and one of a kind –WP plugin that lets you optimize content specifically for crawlers/spiders. Our plugin detects visits from …

Reblogged 5 years ago from www.youtube.com