Controlling Search Engine Crawlers for Better Indexation and Rankings – Whiteboard Friday

Posted by randfish

When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today’s Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don’t want them to go. We’re going to talk a little bit about crawl budget and what you should and shouldn’t have indexed.

As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.

There are a few others that we’re going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let’s dive into those first few first.

Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn’t access, it doesn’t always get respected by Google and Bing. So a lot of folks when you say, “hey, disallow this,” and then you suddenly see those URLs popping up and you’re wondering what’s going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you’ve made a mistake, they think “hey, there’s a lot of links pointing to this content, there’s a lot of people who are visiting and caring about this content, maybe you didn’t intend for us to block it.” The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say “everything behind this entire big directory,” the worse they are about necessarily believing you.

Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it’s usually a lot more respected, because it’s at an individual-page level; Google and Bing tend to believe you about the meta robots tag.

And then the nofollow tag, that lives on an individual link on a page. It doesn’t tell engines where to crawl or not to crawl. All it’s saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.

Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.

What if, for example, we take a page like “blogtest.html” on our domain and we say “all user agents, you are not allowed to crawl blogtest.html. Okay—that’s a good way to keep that page away from being crawled, but just because something is not crawled doesn’t necessarily mean it won’t be in the search results.

So then we have our SEO folks go, “you know what, let’s make doubly sure that doesn’t show up in search results; we’ll put in the meta robots tag:”

<meta name="robots" content="noindex, follow">

So, “noindex, follow” tells the search engine crawler they can follow the links on the page, but they shouldn’t index this particular one.

Then, you go and run a search for “blog test” in this case, and everybody on the team’s like “What the heck!? WTF? Why am I seeing this page show up in search results?”

The answer is, you told the engines that they couldn’t crawl the page, so they didn’t. But they are still putting it in the results. They’re actually probably not going to include a meta description; they might have something like “we can’t include a meta description because of this site’s robots.txt file.” The reason it’s showing up is because they can’t see the noindex; all they see is the disallow.

So, if you want something truly removed, unable to be seen in search results, you can’t just disallow a crawler. You have to say meta “noindex” and you have to let them crawl it.

So this creates some complications. Robots.txt can be great if we’re trying to save crawl bandwidth, but it isn’t necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying “Google, don’t crawl the www version of twitter.com.” What you should be doing is rel canonical-ing or using a 301.

Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.

The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.

Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404’d something or if you’ve told them not to crawl something but it’s still showing up in there, you can manually say “don’t do that.” There are a few other crawl protocol things that you can do.

And then URL status codes—these are a valid way to do things, but they’re going to obviously change what’s going on on your pages, too.

If you’re not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines “it’s back!” 410 is permanent removal.

301—permanent redirect, we’ve talked about those here—and 302, temporary redirect.

Now let’s jump into a few specific use cases of “what kinds of content should and shouldn’t I allow engines to crawl and index” in this next version…

[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]

Four crawling/indexing problems to solve

So we’ve got these four big problems that I want to talk about as they relate to crawling and indexing.

1. Content that isn’t ready yet

The first one here is around, “If I have content of quality I’m still trying to improve—it’s not yet ready for primetime, it’s not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I’m rewriting the content and creating unique value on those pages… they’re just not ready yet—what should I do with those?”

My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I’d disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.

If I’m talking about a small quantity—a few dozen, a few hundred pages—well, I’d probably just use the meta robots noindex, and then I’d pull that noindex off of those pages as they are made ready for Google’s consumption. And then again, I would probably use the XML sitemap and start submitting those once they’re ready.

2. Dealing with duplicate or thin content

What about, “Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?” I’ve got an example. Let’s say I’m an ecommerce shop, I’m selling this nice Star Wars t-shirt which I think is kind of hilarious, so I’ve got starwarsshirt.html, and it links out to a larger version of an image, and that’s an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn’t recommend disallowing crawling on these, and I wouldn’t recommend noindexing them. What I would do there is a rel canonical.

Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn’t see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don’t get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta “noindex, follow” on these pages, but I don’t really think that’s necessary, and again that might interfere with the rel canonical.

3. Passing link equity without appearing in search results

Number three: “If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don’t need those appearing in search results—what should I use then?”

What I would say here is, you can use the meta robots to say “don’t index the page, but do follow the links that are on that page.” That’s a pretty nice, handy use case for that.

Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can’t see the noindex. They don’t know that they can follow it. Granted, as we talked about before, sometimes Google doesn’t obey the robots.txt, but you can’t rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots “noindex, follow” is the way to do this.

4. Search results-type pages

Finally, fourth, “What should I do with search results-type pages?” Google has said many times that they don’t like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.

Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you’ve got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, “I’m looking for restaurants in Seattle, WA,” they’ll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that’s not just a list of search results, that is more of a landing page than a search results page.

However, that being said, if you’ve got a long tail of these, or if you’d say “hey, our internal search engine, that’s really for internal visitors only—it’s not useful to have those pages show up in search results, and we don’t think we need to make the effort to make those into category landing pages.” Then you can use the disallow in robots.txt to prevent those.

Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren’t valuable pages that should be served up and turned into landing pages. If you’re sure, then go ahead and disallow all your search results-style pages. You’ll see a lot of sites doing this in their robots.txt file.

That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I’ll try and tackle those in the comments below.

We’ll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

How does your competitor get so many dot gov links?

Some sites get a lot more government links than others and it’s apretty natural thing to ask why. Was it just luck? Have they some content that is particularly attractive to government sites?  Perhaps they’ve undertaken a specific campaign to win such links? These are questions we’ve probably all asked ourselves at some stage –…

The post How does your competitor get so many dot gov links? appeared first on Majestic Blog.

Reblogged 4 years ago from blog.majestic.com

​We Want Your Stories: Accepting MozCon Ignite Pitches

Posted by EricaMcGillivray

We’re thrilled to announce the addition of a networking and Ignite-style event for attendees on Tuesday night at MozCon. For years, you’ve asked us for more networking and relaxing times, and this is what we’ve dreamed up. But we need your help!

We want you to share your stories, passions, and experiences. There are 16—yes, 16—speaking slots. Ignite-style talks are 5 minutes in length and slides auto-advance. That’s right, there’s no going back, and once it’s done, it’s done!

In order to encourage relaxation, none of these talks will be about online marketing. Instead, we want to use this opportunity to get to know our fellow community members better. We want to hear about your passion projects, interests, and the things that fascinate you outside marketing. Tell us about how you spend weekends making support banners for your favorite soccer team or why you mentor high school students, for example.

The basic details

  • To submit, just fill out the form below.
  • Please only submit one talk! We want the one you’re most excited about.
  • Talks cannot be about online marketing.
  • They are only 5 minutes in length, so plan accordingly.
  • If you are already speaking on the MozCon stage, you cannot pitch for this event.
  • Submissions close on Sunday, May 17 at 5pm PDT.
  • Selection decisions are final and will be made in late May / early June.
  • All presentations must adhere to the MozCon Code of Conduct.
  • You must attend MozCon, July 13-15, and the Tuesday night event in person, in Seattle.

If selected, you will get the following

  • 5 minutes on the Tuesday night stage to share with our audience. The event lasts from 7-10pm and will be at Benaroya Hall (where the Seattle Symphony plays).
  • $300 off a regular priced ticket to MozCon. (If you already purchased yours, we’ll issue a $300 refund for regular priced ticket or $100 for an early bird ticket. Discount not available for super early bird special.)
  • We will work with you to hone your talk!

Loading…

As we want to ensure every single speaker feels both comfortable and gives their best talk possible, myself and Matt Roney are here to help you. We’ll review your topic, settle on the title, walk through your presentation with you, and give you a tour of the stage earlier in the evening. While you do the great work, we’re here to help in anyway possible.

Unfortunately, we cannot provide travel coverage for these MozCon Ignite speaking slots.

What makes a great pitch

  • Focus on the five minute length.
  • Be passionate about what you’re speaking about. Tell us what’s great about it.
  • For extra credit, include links to videos of you doing public speaking.
  • Follow the guidelines. Yes, the word counts are limited on purpose. Do not submit links to Google Docs, etc. for more information. Tricky and multiple submissions will be disqualified.

We’re all super-excited about these talks, and we can’t wait to hear what you might talk about. Whether you want to tell us about how Frenchies are really English dogs or which coffee shop is the best in Seattle, this is going to be blast! The amazing Geraldine DeRuiter, known for her travel blogging and witty ways, will be emceeing this event.

If you’re still needing inspiration or a little confused about an Ignite talk, watch Geraldine’s talk from a few years ago about sharing personal news online:

Like our other speaker selections, we have a small committee at Moz running through these topics to get the best variety and fun possible. While we cannot vet your topic, feel free to ask questions in the comments.

Everyone who submits an Ignite pitch will be informed either way. Best of luck!


Haven’t bought your MozCon ticket?

Buy your ticket now

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

​The 3 Most Common SEO Problems on Listings Sites

Posted by Dom-Woodman

Listings sites have a very specific set of search problems that you don’t run into everywhere else. In the day I’m one of Distilled’s analysts, but by night I run a job listings site, teflSearch. So, for my first Moz Blog post I thought I’d cover the three search problems with listings sites that I spent far too long agonising about.

Quick clarification time: What is a listings site (i.e. will this post be useful for you)?

The classic listings site is Craigslist, but plenty of other sites act like listing sites:

  • Job sites like Monster
  • E-commerce sites like Amazon
  • Matching sites like Spareroom

1. Generating quality landing pages

The landing pages on listings sites are incredibly important. These pages are usually the primary drivers of converting traffic, and they’re usually generated automatically (or are occasionally custom category pages) .

For example, if I search “Jobs in Manchester“, you can see nearly every result is an automatically generated landing page or category page.

There are three common ways to generate these pages (occasionally a combination of more than one is used):

  • Faceted pages: These are generated by facets—groups of preset filters that let you filter the current search results. They usually sit on the left-hand side of the page.
  • Category pages: These pages are listings which have already had a filter applied and can’t be changed. They’re usually custom pages.
  • Free-text search pages: These pages are generated by a free-text search box.

Those definitions are still bit general; let’s clear them up with some examples:

Amazon uses a combination of categories and facets. If you click on browse by department you can see all the category pages. Then on each category page you can see a faceted search. Amazon is so large that it needs both.

Indeed generates its landing pages through free text search, for example if we search for “IT jobs in manchester” it will generate: IT jobs in manchester.

teflSearch generates landing pages using just facets. The jobs in China landing page is simply a facet of the main search page.

Each method has its own search problems when used for generating landing pages, so lets tackle them one by one.


Aside

Facets and free text search will typically generate pages with parameters e.g. a search for “dogs” would produce:

www.mysite.com?search=dogs

But to make the URL user friendly sites will often alter the URLs to display them as folders

www.mysite.com/results/dogs/

These are still just ordinary free text search and facets, the URLs are just user friendly. (They’re a lot easier to work with in robots.txt too!)


Free search (& category) problems

If you’ve decided the base of your search will be a free text search, then we’ll have two major goals:

  • Goal 1: Helping search engines find your landing pages
  • Goal 2: Giving them link equity.

Solution

Search engines won’t use search boxes and so the solution to both problems is to provide links to the valuable landing pages so search engines can find them.

There are plenty of ways to do this, but two of the most common are:

  • Category links alongside a search

    Photobucket uses a free text search to generate pages, but if we look at example search for photos of dogs, we can see the categories which define the landing pages along the right-hand side. (This is also an example of URL friendly searches!)

  • Putting the main landing pages in a top-level menu

    Indeed also uses free text to generate landing pages, and they have a browse jobs section which contains the URL structure to allow search engines to find all the valuable landing pages.

Breadcrumbs are also often used in addition to the two above and in both the examples above, you’ll find breadcrumbs that reinforce that hierarchy.

Category (& facet) problems

Categories, because they tend to be custom pages, don’t actually have many search disadvantages. Instead it’s the other attributes that make them more or less desirable. You can create them for the purposes you want and so you typically won’t have too many problems.

However, if you also use a faceted search in each category (like Amazon) to generate additional landing pages, then you’ll run into all the problems described in the next section.

At first facets seem great, an easy way to generate multiple strong relevant landing pages without doing much at all. The problems appear because people don’t put limits on facets.

Lets take the job page on teflSearch. We can see it has 18 facets each with many options. Some of these options will generate useful landing pages:

The China facet in countries will generate “Jobs in China” that’s a useful landing page.

On the other hand, the “Conditional Bonus” facet will generate “Jobs with a conditional bonus,” and that’s not so great.

We can also see that the options within a single facet aren’t always useful. As of writing, I have a single job available in Serbia. That’s not a useful search result, and the poor user engagement combined with the tiny amount of content will be a strong signal to Google that it’s thin content. Depending on the scale of your site it’s very easy to generate a mass of poor-quality landing pages.

Facets generate other problems too. The primary one being they can create a huge amount of duplicate content and pages for search engines to get lost in. This is caused by two things: The first is the sheer number of possibilities they generate, and the second is because selecting facets in different orders creates identical pages with different URLs.

We end up with four goals for our facet-generated landing pages:

  • Goal 1: Make sure our searchable landing pages are actually worth landing on, and that we’re not handing a mass of low-value pages to the search engines.
  • Goal 2: Make sure we don’t generate multiple copies of our automatically generated landing pages.
  • Goal 3: Make sure search engines don’t get caught in the metaphorical plastic six-pack rings of our facets.
  • Goal 4: Make sure our landing pages have strong internal linking.

The first goal needs to be set internally; you’re always going to be the best judge of the number of results that need to present on a page in order for it to be useful to a user. I’d argue you can rarely ever go below three, but it depends both on your business and on how much content fluctuates on your site, as the useful landing pages might also change over time.

We can solve the next three problems as group. There are several possible solutions depending on what skills and resources you have access to; here are two possible solutions:

Category/facet solution 1: Blocking the majority of facets and providing external links
  • Easiest method
  • Good if your valuable category pages rarely change and you don’t have too many of them.
  • Can be problematic if your valuable facet pages change a lot

Nofollow all your facet links, and noindex and block category pages which aren’t valuable or are deeper than x facet/folder levels into your search using robots.txt.

You set x by looking at where your useful facet pages exist that have search volume. So, for example, if you have three facets for televisions: manufacturer, size, and resolution, and even combinations of all three have multiple results and search volume, then you could set you index everything up to three levels.

On the other hand, if people are searching for three levels (e.g. “Samsung 42″ Full HD TV”) but you only have one or two results for three-level facets, then you’d be better off indexing two levels and letting the product pages themselves pick up long-tail traffic for the third level.

If you have valuable facet pages that exist deeper than 1 facet or folder into your search, then this creates some duplicate content problems dealt with in the aside “Indexing more than 1 level of facets” below.)

The immediate problem with this set-up, however, is that in one stroke we’ve removed most of the internal links to our category pages, and by no-following all the facet links, search engines won’t be able to find your valuable category pages.

In order re-create the linking, you can add a top level drop down menu to your site containing the most valuable category pages, add category links elsewhere on the page, or create a separate part of the site with links to the valuable category pages.

The top level drop down menu you can see on teflSearch (it’s the search jobs menu), the other two examples are demonstrated in Photobucket and Indeed respectively in the previous section.

The big advantage for this method is how quick it is to implement, it doesn’t require any fiddly internal logic and adding an extra menu option is usually minimal effort.

Category/facet solution 2: Creating internal logic to work with the facets

  • Requires new internal logic
  • Works for large numbers of category pages with value that can change rapidly

There are four parts to the second solution:

  1. Select valuable facet categories and allow those links to be followed. No-follow the rest.
  2. No-index all pages that return a number of items below the threshold for a useful landing page
  3. No-follow all facets on pages with a search depth greater than x.
  4. Block all facet pages deeper than x level in robots.txt

As with the last solution, x is set by looking at where your useful facet pages exist that have search volume (full explanation in the first solution), and if you’re indexing more than one level you’ll need to check out the aside below to see how to deal with the duplicate content it generates.


Aside: Indexing more than one level of facets

If you want more than one level of facets to be indexable, then this will create certain problems.

Suppose you have a facet for size:

  • Televisions: Size: 46″, 44″, 42″

And want to add a brand facet:

  • Televisions: Brand: Samsung, Panasonic, Sony

This will create duplicate content because the search engines will be able to follow your facets in both orders, generating:

  • Television – 46″ – Samsung
  • Television – Samsung – 46″

You’ll have to either rel canonical your duplicate pages with another rule or set up your facets so they create a single unique URL.

You also need to be aware that each followable facet you add will multiply with each other followable facet and it’s very easy to generate a mass of pages for search engines to get stuck in. Depending on your setup you might need to block more paths in robots.txt or set-up more logic to prevent them being followed.

Letting search engines index more than one level of facets adds a lot of possible problems; make sure you’re keeping track of them.


2. User-generated content cannibalization

This is a common problem for listings sites (assuming they allow user generated content). If you’re reading this as an e-commerce site who only lists their own products, you can skip this one.

As we covered in the first area, category pages on listings sites are usually the landing pages aiming for the valuable search terms, but as your users start generating pages they can often create titles and content that cannibalise your landing pages.

Suppose you’re a job site with a category page for PHP Jobs in Greater Manchester. If a recruiter then creates a job advert for PHP Jobs in Greater Manchester for the 4 positions they currently have, you’ve got a duplicate content problem.

This is less of a problem when your site is large and your categories mature, it will be obvious to any search engine which are your high value category pages, but at the start where you’re lacking authority and individual listings might contain more relevant content than your own search pages this can be a problem.

Solution 1: Create structured titles

Set the <title> differently than the on-page title. Depending on variables you have available to you can set the title tag programmatically without changing the page title using other information given by the user.

For example, on our imaginary job site, suppose the recruiter also provided the following information in other fields:

  • The no. of positions: 4
  • The primary area: PHP Developer
  • The name of the recruiting company: ABC Recruitment
  • Location: Manchester

We could set the <title> pattern to be: *No of positions* *The primary area* with *recruiter name* in *Location* which would give us:

4 PHP Developers with ABC Recruitment in Manchester

Setting a <title> tag allows you to target long-tail traffic by constructing detailed descriptive titles. In our above example, imagine the recruiter had specified “Castlefield, Manchester” as the location.

All of a sudden, you’ve got a perfect opportunity to pick up long-tail traffic for people searching in Castlefield in Manchester.

On the downside, you lose the ability to pick up long-tail traffic where your users have chosen keywords you wouldn’t have used.

For example, suppose Manchester has a jobs program called “Green Highway.” A job advert title containing “Green Highway” might pick up valuable long-tail traffic. Being able to discover this, however, and find a way to fit it into a dynamic title is very hard.

Solution 2: Use regex to noindex the offending pages

Perform a regex (or string contains) search on your listings titles and no-index the ones which cannabalise your main category pages.

If it’s not possible to construct titles with variables or your users provide a lot of additional long-tail traffic with their own titles, then is a great option. On the downside, you miss out on possible structured long-tail traffic that you might’ve been able to aim for.

Solution 3: De-index all your listings

It may seem rash, but if you’re a large site with a huge number of very similar or low-content listings, you might want to consider this, but there is no common standard. Some sites like Indeed choose to no-index all their job adverts, whereas some other sites like Craigslist index all their individual listings because they’ll drive long tail traffic.

Don’t de-index them all lightly!

3. Constantly expiring content

Our third and final problem is that user-generated content doesn’t last forever. Particularly on listings sites, it’s constantly expiring and changing.

For most use cases I’d recommend 301’ing expired content to a relevant category page, with a message triggered by the redirect notifying the user of why they’ve been redirected. It typically comes out as the best combination of search and UX.

For more information or advice on how to deal with the edge cases, there’s a previous Moz blog post on how to deal with expired content which I think does an excellent job of covering this area.

Summary

In summary, if you’re working with listings sites, all three of the following need to be kept in mind:

  • How are the landing pages generated? If they’re generated using free text or facets have the potential problems been solved?
  • Is user generated content cannibalising the main landing pages?
  • How has constantly expiring content been dealt with?

Good luck listing, and if you’ve had any other tricky problems or solutions you’ve come across working on listings sites lets chat about them in the comments below!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Spam Score: Moz’s New Metric to Measure Penalization Risk

Posted by randfish

Today, I’m very excited to announce that Moz’s Spam Score, an R&D project we’ve worked on for nearly a year, is finally going live. In this post, you can learn more about how we’re calculating spam score, what it means, and how you can potentially use it in your SEO work.

How does Spam Score work?

Over the last year, our data science team, led by 
Dr. Matt Peters, examined a great number of potential factors that predicted that a site might be penalized or banned by Google. We found strong correlations with 17 unique factors we call “spam flags,” and turned them into a score.

Almost every subdomain in 
Mozscape (our web index) now has a Spam Score attached to it, and this score is viewable inside Open Site Explorer (and soon, the MozBar and other tools). The score is simple; it just records the quantity of spam flags the subdomain triggers. Our correlations showed that no particular flag was more likely than others to mean a domain was penalized/banned in Google, but firing many flags had a very strong correlation (you can see the math below).

Spam Score currently operates only on the subdomain level—we don’t have it for pages or root domains. It’s been my experience and the experience of many other SEOs in the field that a great deal of link spam is tied to the subdomain-level. There are plenty of exceptions—manipulative links can and do live on plenty of high-quality sites—but as we’ve tested, we found that subdomain-level Spam Score was the best solution we could create at web scale. It does a solid job with the most obvious, nastiest spam, and a decent job highlighting risk in other areas, too.

How to access Spam Score

Right now, you can find Spam Score inside 
Open Site Explorer, both in the top metrics (just below domain/page authority) and in its own tab labeled “Spam Analysis.” Spam Score is only available for Pro subscribers right now, though in the future, we may make the score in the metrics section available to everyone (if you’re not a subscriber, you can check it out with a free trial). 

The current Spam Analysis page includes a list of subdomains or pages linking to your site. You can toggle the target to look at all links to a given subdomain on your site, given pages, or the entire root domain. You can further toggle source tier to look at the Spam Score for incoming linking pages or subdomains (but in the case of pages, we’re still showing the Spam Score for the subdomain on which that page is hosted).

You can click on any Spam Score row and see the details about which flags were triggered. We’ll bring you to a page like this:

Back on the original Spam Analysis page, at the very bottom of the rows, you’ll find an option to export a disavow file, which is compatible with Google Webmaster Tools. You can choose to filter the file to contain only those sites with a given spam flag count or higher:

Disavow exports usually take less than 3 hours to finish. We can send you an email when it’s ready, too.

WARNING: Please do not export this file and simply upload it to Google! You can really, really hurt your site’s ranking and there may be no way to recover. Instead, carefully sort through the links therein and make sure you really do want to disavow what’s in there. You can easily remove/edit the file to take out links you feel are not spam. When Moz’s Cyrus Shepard disavowed every link to his own site, it took more than a year for his rankings to return!

We’ve actually made the file not-wholly-ready for upload to Google in order to be sure folks aren’t too cavalier with this particular step. You’ll need to open it up and make some edits (specifically to lines at the top of the file) in order to ready it for Webmaster Tools

In the near future, we hope to have Spam Score in the Mozbar as well, which might look like this: 

Sweet, right? 🙂

Potential use cases for Spam Analysis

This list probably isn’t exhaustive, but these are a few of the ways we’ve been playing around with the data:

  1. Checking for spammy links to your own site: Almost every site has at least a few bad links pointing to it, but it’s been hard to know how much or how many potentially harmful links you might have until now. Run a quick spam analysis and see if there’s enough there to cause concern.
  2. Evaluating potential links: This is a big one where we think Spam Score can be helpful. It’s not going to catch every potentially bad link, and you should certainly still use your brain for evaluation too, but as you’re scanning a list of link opportunities or surfing to various sites, having the ability to see if they fire a lot of flags is a great warning sign.
  3. Link cleanup: Link cleanup projects can be messy, involved, precarious, and massively tedious. Spam Score might not catch everything, but sorting links by it can be hugely helpful in identifying potentially nasty stuff, and filtering out the more probably clean links.
  4. Disavow Files: Again, because Spam Score won’t perfectly catch everything, you will likely need to do some additional work here (especially if the site you’re working on has done some link buying on more generally trustworthy domains), but it can save you a heap of time evaluating and listing the worst and most obvious junk.

Over time, we’re also excited about using Spam Score to help improve the PA and DA calculations (it’s not currently in there), as well as adding it to other tools and data sources. We’d love your feedback and insight about where you’d most want to see Spam Score get involved.

Details about Spam Score’s calculation

This section comes courtesy of Moz’s head of data science, Dr. Matt Peters, who created the metric and deserves (at least in my humble opinion) a big round of applause. – Rand

Definition of “spam”

Before diving into the details of the individual spam flags and their calculation, it’s important to first describe our data gathering process and “spam” definition.

For our purposes, we followed Google’s definition of spam and gathered labels for a large number of sites as follows.

  • First, we randomly selected a large number of subdomains from the Mozscape index stratified by mozRank.
  • Then we crawled the subdomains and threw out any that didn’t return a “200 OK” (redirects, errors, etc).
  • Finally, we collected the top 10 de-personalized, geo-agnostic Google-US search results using the full subdomain name as the keyword and checked whether any of those results matched the original keyword. If they did not, we called the subdomain “spam,” otherwise we called it “ham.”

We performed the most recent data collection in November 2014 (after the Penguin 3.0 update) for about 500,000 subdomains.

Relationship between number of flags and spam

The overall Spam Score is currently an aggregate of 17 different “flags.” You can think of each flag a potential “warning sign” that signals that a site may be spammy. The overall likelihood of spam increases as a site accumulates more and more flags, so that the total number of flags is a strong predictor of spam. Accordingly, the flags are designed to be used together—no single flag, or even a few flags, is cause for concern (and indeed most sites will trigger at least a few flags).

The following table shows the relationship between the number of flags and percent of sites with those flags that we found Google had penalized or banned:

ABOVE: The overall probability of spam vs. the number of spam flags. Data collected in Nov. 2014 for approximately 500K subdomains. The table also highlights the three overall danger levels: low/green (< 10%) moderate/yellow (10-50%) and high/red (>50%)

The overall spam percent averaged across a large number of sites increases in lock step with the number of flags; however there are outliers in every category. For example, there are a small number of sites with very few flags that are tagged as spam by Google and conversely a small number of sites with many flags that are not spam.

Spam flag details

The individual spam flags capture a wide range of spam signals link profiles, anchor text, on page signals and properties of the domain name. At a high level the process to determine the spam flags for each subdomain is:

  • Collect link metrics from Mozscape (mozRank, mozTrust, number of linking domains, etc).
  • Collect anchor text metrics from Mozscape (top anchor text phrases sorted by number of links)
  • Collect the top five pages by Page Authority on the subdomain from Mozscape
  • Crawl the top five pages plus the home page and process to extract on page signals
  • Provide the output for Mozscape to include in the next index release cycle

Since the spam flags are incorporated into in the Mozscape index, fresh data is released with each new index. Right now, we crawl and process the spam flags for each subdomains every two – three months although this may change in the future.

Link flags

The following table lists the link and anchor text related flags with the the odds ratio for each flag. For each flag, we can compute two percents: the percent of sites with that flag that are penalized by Google and the percent of sites with that flag that were not penalized. The odds ratio is the ratio of these percents and gives the increase in likelihood that a site is spam if it has the flag. For example, the first row says that a site with this flag is 12.4 times more likely to be spam than one without the flag.

ABOVE: Description and odds ratio of link and anchor text related spam flags. In addition to a description, it lists the odds ratio for each flag which gives the overall increase in spam likelihood if the flag is present).

Working down the table, the flags are:

  • Low mozTrust to mozRank ratio: Sites with low mozTrust compared to mozRank are likely to be spam.
  • Large site with few links: Large sites with many pages tend to also have many links and large sites without a corresponding large number of links are likely to be spam.
  • Site link diversity is low: If a large percentage of links to a site are from a few domains it is likely to be spam.
  • Ratio of followed to nofollowed subdomains/domains (two separate flags): Sites with a large number of followed links relative to nofollowed are likely to be spam.
  • Small proportion of branded links (anchor text): Organically occurring links tend to contain a disproportionate amount of banded keywords. If a site does not have a lot of branded anchor text, it’s a signal the links are not organic.

On-page flags

Similar to the link flags, the following table lists the on page and domain name related flags:

ABOVE: Description and odds ratio of on page and domain name related spam flags. In addition to a description, it lists the odds ratio for each flag which gives the overall increase in spam likelihood if the flag is present).

  • Thin content: If a site has a relatively small ratio of content to navigation chrome it’s likely to be spam.
  • Site mark-up is abnormally small: Non-spam sites tend to invest in rich user experiences with CSS, Javascript and extensive mark-up. Accordingly, a large ratio of text to mark-up is a spam signal.
  • Large number of external links: A site with a large number of external links may look spammy.
  • Low number of internal links: Real sites tend to link heavily to themselves via internal navigation and a relative lack of internal links is a spam signal.
  • Anchor text-heavy page: Sites with a lot of anchor text are more likely to be spam then those with more content and less links.
  • External links in navigation: Spam sites may hide external links in the sidebar or footer.
  • No contact info: Real sites prominently display their social and other contact information.
  • Low number of pages found: A site with only one or a few pages is more likely to be spam than one with many pages.
  • TLD correlated with spam domains: Certain TLDs are more spammy than others (e.g. pw).
  • Domain name length: A long subdomain name like “bycheapviagra.freeshipping.onlinepharmacy.com” may indicate keyword stuffing.
  • Domain name contains numerals: domain names with numerals may be automatically generated and therefore spam.

If you’d like some more details on the technical aspects of the spam score, check out the 
video of Matt’s 2012 MozCon talk about Algorithmic Spam Detection or the slides (many of the details have evolved, but the overall ideas are the same):

We’d love your feedback

As with all metrics, Spam Score won’t be perfect. We’d love to hear your feedback and ideas for improving the score as well as what you’d like to see from it’s in-product application in the future. Feel free to leave comments on this post, or to email Matt (matt at moz dot com) and me (rand at moz dot com) privately with any suggestions.

Good luck cleaning up and preventing link spam!



Not a Pro Subscriber? No problem!



Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

How You Can Build a Meaningful Brand

Posted by Hannah_Smith

Earlier this year I wrote a post about
the future of marketing. In it, I made a handful of predictions; arguably the most ‘out there’ of which was this: in the future, only brands which ‘mean something’ to consumers will survive. 

In today’s post I’ll be exploring what it takes to become a meaningful brand, and how you might go about building one.

SEO is not dead

Just so we’re clear, I’m not saying SEO is dead, or that organic search is not an important channel 🙂
These stats speak for themselves:

However, what we’re being asked to do as SEOs is constantly evolving. 

It used to be that you could build a very successful business online just by being great at SEO. But today, the SERPs are changing, and ranking first doesn’t mean what it used to:

The BBC still rank first organically for ‘weather’—but their listing is pushed beneath the fold. Plus, given the that the information the searcher is seeking is displayed right there in the SERP, I’m guessing they’re not receiving as much traffic from this term as they once were.

But it’s not just informational queries:

Skyscanner still rank first for the term ‘flights to paris’, but again here their organic listing is pushed beneath the fold thanks to paid search listings and the proprietary Google flight product.

Google is even going so far as to show its proprietary products against branded searches (hat-tip to 
Barry Adams for pointing this out):

MoneySuperMarket’s organic listing is above the fold, but Google is nonetheless being very aggressive.

As a consequence of these changes, as SEOs, we’re being asked to do different things. Clients of yesteryear used to say things like:

Get us links!

But today they’re saying things like:

Get us press coverage, social shares and exposure [links] on sites our target audience reads.

Whilst they may not explicitly be asking us to build a brand, nonetheless much of what we do today looks a lot like brand building. But where do we start?

What does ‘brand’ mean?

Before we kick off I think it’s worth exploring what brand really means. We have a tendency to use ‘brand’ and ‘company’ or ‘organisation’ interchangeably, but in reality they are two distinctly different things.

Here’s a definition:

brand – to impress firmly; fix ineradicably; place indelibly

Therefore a brand is not a brand unless it leaves a lasting impression, and of course, it needs to be a favourable impression. Essentially companies or organisations need to build brands that mean something to people.

However, right now companies and organisations are struggling to do this effectively:

“In Europe and the US, consumers would not care if 92% of brands ceased to exist” 


source

That means that consumers would only miss 8% of brands. 

Clearly we have a mountain to climb. How do we go about building meaningful brands? Particularly on SEO retainer budgets?

You can learn a lot by deconstructing the success of others

Like many in the search industry, I’m a fan of taking stuff apart to figure out how it works. So, when trying to figure out how to go about building a meaningful brand, I started by looking at what meaningful brands are doing right now.

I uncovered three core principles—some meaningful brands do all three; some just do one or two—I’ll deal with them each in turn.

1) Meaningful brands find opportunities to delight customers

Most people’s interactions with brands suck. But great interactions stand out and are shared. Let’s take a look at some examples:

@smartcarusa

Here’s how @smartcarusa responded when someone suggested that a single bird dropping would total one of their cars:

Now the takeaway here is not to rush out and make a bunch of infographics on disparate topics.
Out of context, the infographic is neither remarkable, nor particularly interesting, and I don’t think it would have garnered coverage had it not been created in response to this tweet.
But I think a lesson we can take from this is that going the extra mile to respond in a novel way can yield out-sized returns.

@ArgosHelpers

This is how @ArgosHelpers responded to a customer asking when PS4s would be back in stock:

The takeaway here is not people love brands who use slang—I think this is actually a very artfully worded response. See how the brand has taken care to use the same language as their customer without being in any way condescending? That’s what you need to shoot for.

@TescoMobile

This is how @TescoMobile responded when someone described their network as a ‘turn off’:

Whoa! 

The lesson here is definitely not ‘be a dick to people who are dicks to you’; I think the lesson here is that a well-judged, cheeky response can travel.

Ultimately you need to tread carefully if you want to use this type of tactic. I think @TescoMobile got away with this one—but it is really close to the line. To do this sort of thing you need to have a deep understanding of your audience—what’s considered funny and what’s just plain rude? This can vary hugely depending on the niche you’re working in and the public perception of your brand.

Moreover, if you’re a brand engaging in this sort of activity, you need to consider not only your own response, but the potential response from your audience, too. Some brands have an army of loyal advocates. But if brands aren’t careful, they may unwittingly encourage said army to attack an individual with a response like this.

Of course it’s not just interactions that have the capacity to delight—sometimes being nimble is enough:

@Arbys

When Pharrell turned up to the GRAMMYs wearing *that hat* here’s how @Arbys responded:

The takeaway here is not that you need a bit of luck, instead it’s that you need to be ready, willing and able to take advantage of opportunities as and when they arise. I think that if @Arbys hadn’t tweeted that, then someone else would have done and their brand wouldn’t have benefited.

Hopefully you can see where I’m going with this; let’s move on to principle two:

2) Meaningful brands give people the ability to define themselves to others

Have you ever thought about why you share what you share on social media? Most of us don’t think about it too much, but
The New York Times did a
study on the psychology of sharing in which 68% of respondents said they share things via social media to give others a better sense of who they are and what they care about.

For example, I might share an article from
hbr.org because I want you to think I’m the sort of person who reads Harvard Business Review. Or I might share an Oatmeal comic because I want you to think I have an excellent sense of humour. I might share something about the Lean In movement because I want to let you know where I stand on important issues.

If you’re seeking to create a meaningful brand, this can be an excellent space to play in because brands can give people the ability to define themselves to others. Now I don’t necessarily mean by creating content like
this which literally allows people to define themselves:

Brands can also help people define themselves by creating things people ‘look good’ sharing—let’s take a look at some examples:

GE’s #6SecondScience

The takeaway here is to create things which are tangentially related to your brand, that people ‘look good’ sharing. When people shared this content they were sharing stuff that was more than just ‘cool’—by sharing this content they were also able to express their enthusiasm for science.

In a similar vein meaningful brands create commercials that don’t feel like commercials—again, these are things that people ‘look good’ sharing:

Wren’s First Kiss

This film definitely got people talking. To date it’s received over 94 million YouTube views and coverage on over 1300 sites. But this isn’t just a video content play…

Oreo

When Oreo turned 100, they created 100 pieces of content over 100 days:

This campaign got over 1m Facebook ‘likes’ and thousands of pieces of press coverage. 

But actually, I think the smartest thing about this campaign was that it was highly topical content which put the cookie right in the centre of people’s conversations without being self-serving.

Still with me? Let’s move on to principle three:

3) Meaningful brands stand for something above and beyond their products or services

This is difficult to explain in the abstract, so I’m going to shoot straight to some examples.

BrewDog

BrewDog is a craft beer company. Their brand values are drawn from punk subculture—they’re anti-establishment and believe in individual freedom.
So when Dead Pony Club ale was ‘banned’ because the phrase “rip it up down empty streets” was printed on the label, their response was to issue a
press release apologising for ‘not giving a shit’ over the marketing rules breach.
Their fans loved their response:

The takeaway here isn’t that sweary press releases get attention (although they undoubtedly do)—by refusing to take the ruling lying down BrewDog showed people they were a brand which stood for something beyond great beer.

Nike

A core value for Nike is “if you have a body, you are an athlete”, and these values have inspired incredible creative like this:

I think that this advert is powerful because Nike isn’t talking about how their trainers enhance your performance, they’re talking about celebrating everyone’s athletic endeavours. It’s about more than just their products. 

OKCupid

I think that taking the decision to stand for something is perhaps most potent when it could actually cost a brand customers. When Mozilla appointed a new CEO, OKCupid showed this message to Firefox users:

They went on to say:

The takeaway here is not ‘align your brand with a cause and win the the Internet’, but rather, taking a bold stance on a relevant issue, even if it could actually hurt your business, can create a lasting impression.

What do I mean by ‘actually hurt your business’? Sadly, not everyone believes in equal rights for gay couples, as such, taking this stance could cost OK Cupid.

Using these principles day-to-day

The reality for me is that right now, much of this I can’t affect—sadly no clients have dropped several million into my lap and asked me to create them an ad like Nike’s 🙂

That said, I do think that it’s helped me to clarify my thinking on what it means to be a meaningful brand and how to figure out how to get there. At Distilled (the company who is good enough to employ me), the place we play most frequently is principle two—we create content which allows people to define themselves to others; things that people ‘look good sharing’. 

Perhaps more importantly, we’re taking the time to understand the companies we’re working with better so that our creative work is better aligned with their brand values. 

And so, dear reader, over to you—I’d love to hear what you think it takes to build a meaningful brand, and what’s working (and not working) for you, do let me know via the comments.

This post is based on a session I presented at SearchLove; those who are interested can view the full deck below:

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 5 years ago from moz.com