Why Effective, Modern SEO Requires Technical, Creative, and Strategic Thinking – Whiteboard Friday

Posted by randfish

There’s no doubt that quite a bit has changed about SEO, and that the field is far more integrated with other aspects of online marketing than it once was. In today’s Whiteboard Friday, Rand pushes back against the idea that effective modern SEO doesn’t require any technical expertise, outlining a fantastic list of technical elements that today’s SEOs need to know about in order to be truly effective.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week I’m going to do something unusual. I don’t usually point out these inconsistencies or sort of take issue with other folks’ content on the web, because I generally find that that’s not all that valuable and useful. But I’m going to make an exception here.

There is an article by Jayson DeMers, who I think might actually be here in Seattle — maybe he and I can hang out at some point — called “Why Modern SEO Requires Almost No Technical Expertise.” It was an article that got a shocking amount of traction and attention. On Facebook, it has thousands of shares. On LinkedIn, it did really well. On Twitter, it got a bunch of attention.

Some folks in the SEO world have already pointed out some issues around this. But because of the increasing popularity of this article, and because I think there’s, like, this hopefulness from worlds outside of kind of the hardcore SEO world that are looking to this piece and going, “Look, this is great. We don’t have to be technical. We don’t have to worry about technical things in order to do SEO.”

Look, I completely get the appeal of that. I did want to point out some of the reasons why this is not so accurate. At the same time, I don’t want to rain on Jayson, because I think that it’s very possible he’s writing an article for Entrepreneur, maybe he has sort of a commitment to them. Maybe he had no idea that this article was going to spark so much attention and investment. He does make some good points. I think it’s just really the title and then some of the messages inside there that I take strong issue with, and so I wanted to bring those up.

First off, some of the good points he did bring up.

One, he wisely says, “You don’t need to know how to code or to write and read algorithms in order to do SEO.” I totally agree with that. If today you’re looking at SEO and you’re thinking, “Well, am I going to get more into this subject? Am I going to try investing in SEO? But I don’t even know HTML and CSS yet.”

Those are good skills to have, and they will help you in SEO, but you don’t need them. Jayson’s totally right. You don’t have to have them, and you can learn and pick up some of these things, and do searches, watch some Whiteboard Fridays, check out some guides, and pick up a lot of that stuff later on as you need it in your career. SEO doesn’t have that hard requirement.

And secondly, he makes an intelligent point that we’ve made many times here at Moz, which is that, broadly speaking, a better user experience is well correlated with better rankings.

You make a great website that delivers great user experience, that provides the answers to searchers’ questions and gives them extraordinarily good content, way better than what’s out there already in the search results, generally speaking you’re going to see happy searchers, and that’s going to lead to higher rankings.

But not entirely. There are a lot of other elements that go in here. So I’ll bring up some frustrating points around the piece as well.

First off, there’s no acknowledgment — and I find this a little disturbing — that the ability to read and write code, or even HTML and CSS, which I think are the basic place to start, is helpful or can take your SEO efforts to the next level. I think both of those things are true.

So being able to look at a web page, view source on it, or pull up Firebug in Firefox or something and diagnose what’s going on and then go, “Oh, that’s why Google is not able to see this content. That’s why we’re not ranking for this keyword or term, or why even when I enter this exact sentence in quotes into Google, which is on our page, this is why it’s not bringing it up. It’s because it’s loading it after the page from a remote file that Google can’t access.” These are technical things, and being able to see how that code is built, how it’s structured, and what’s going on there, very, very helpful.

Some coding knowledge also can take your SEO efforts even further. I mean, so many times, SEOs are stymied by the conversations that we have with our programmers and our developers and the technical staff on our teams. When we can have those conversations intelligently, because at least we understand the principles of how an if-then statement works, or what software engineering best practices are being used, or they can upload something into a GitHub repository, and we can take a look at it there, that kind of stuff is really helpful.

Secondly, I don’t like that the article overly reduces all of this information that we have about what we’ve learned about Google. So he mentions two sources. One is things that Google tells us, and others are SEO experiments. I think both of those are true. Although I’d add that there’s sort of a sixth sense of knowledge that we gain over time from looking at many, many search results and kind of having this feel for why things rank, and what might be wrong with a site, and getting really good at that using tools and data as well. There are people who can look at Open Site Explorer and then go, “Aha, I bet this is going to happen.” They can look, and 90% of the time they’re right.

So he boils this down to, one, write quality content, and two, reduce your bounce rate. Neither of those things are wrong. You should write quality content, although I’d argue there are lots of other forms of quality content that aren’t necessarily written — video, images and graphics, podcasts, lots of other stuff.

And secondly, that just doing those two things is not always enough. So you can see, like many, many folks look and go, “I have quality content. It has a low bounce rate. How come I don’t rank better?” Well, your competitors, they’re also going to have quality content with a low bounce rate. That’s not a very high bar.

Also, frustratingly, this really gets in my craw. I don’t think “write quality content” means anything. You tell me. When you hear that, to me that is a totally non-actionable, non-useful phrase that’s a piece of advice that is so generic as to be discardable. So I really wish that there was more substance behind that.

The article also makes, in my opinion, the totally inaccurate claim that modern SEO really is reduced to “the happier your users are when they visit your site, the higher you’re going to rank.”

Wow. Okay. Again, I think broadly these things are correlated. User happiness and rank is broadly correlated, but it’s not a one to one. This is not like a, “Oh, well, that’s a 1.0 correlation.”

I would guess that the correlation is probably closer to like the page authority range. I bet it’s like 0.35 or something correlation. If you were to actually measure this broadly across the web and say like, “Hey, were you happier with result one, two, three, four, or five,” the ordering would not be perfect at all. It probably wouldn’t even be close.

There’s a ton of reasons why sometimes someone who ranks on Page 2 or Page 3 or doesn’t rank at all for a query is doing a better piece of content than the person who does rank well or ranks on Page 1, Position 1.

Then the article suggests five and sort of a half steps to successful modern SEO, which I think is a really incomplete list. So Jayson gives us;

  • Good on-site experience
  • Writing good content
  • Getting others to acknowledge you as an authority
  • Rising in social popularity
  • Earning local relevance
  • Dealing with modern CMS systems (which he notes most modern CMS systems are SEO-friendly)

The thing is there’s nothing actually wrong with any of these. They’re all, generally speaking, correct, either directly or indirectly related to SEO. The one about local relevance, I have some issue with, because he doesn’t note that there’s a separate algorithm for sort of how local SEO is done and how Google ranks local sites in maps and in their local search results. Also not noted is that rising in social popularity won’t necessarily directly help your SEO, although it can have indirect and positive benefits.

I feel like this list is super incomplete. Okay, I brainstormed just off the top of my head in the 10 minutes before we filmed this video a list. The list was so long that, as you can see, I filled up the whole whiteboard and then didn’t have any more room. I’m not going to bother to erase and go try and be absolutely complete.

But there’s a huge, huge number of things that are important, critically important for technical SEO. If you don’t know how to do these things, you are sunk in many cases. You can’t be an effective SEO analyst, or consultant, or in-house team member, because you simply can’t diagnose the potential problems, rectify those potential problems, identify strategies that your competitors are using, be able to diagnose a traffic gain or loss. You have to have these skills in order to do that.

I’ll run through these quickly, but really the idea is just that this list is so huge and so long that I think it’s very, very, very wrong to say technical SEO is behind us. I almost feel like the opposite is true.

We have to be able to understand things like;

  • Content rendering and indexability
  • Crawl structure, internal links, JavaScript, Ajax. If something’s post-loading after the page and Google’s not able to index it, or there are links that are accessible via JavaScript or Ajax, maybe Google can’t necessarily see those or isn’t crawling them as effectively, or is crawling them, but isn’t assigning them as much link weight as they might be assigning other stuff, and you’ve made it tough to link to them externally, and so they can’t crawl it.
  • Disabling crawling and/or indexing of thin or incomplete or non-search-targeted content. We have a bunch of search results pages. Should we use rel=prev/next? Should we robots.txt those out? Should we disallow from crawling with meta robots? Should we rel=canonical them to other pages? Should we exclude them via the protocols inside Google Webmaster Tools, which is now Google Search Console?
  • Managing redirects, domain migrations, content updates. A new piece of content comes out, replacing an old piece of content, what do we do with that old piece of content? What’s the best practice? It varies by different things. We have a whole Whiteboard Friday about the different things that you could do with that. What about a big redirect or a domain migration? You buy another company and you’re redirecting their site to your site. You have to understand things about subdomain structures versus subfolders, which, again, we’ve done another Whiteboard Friday about that.
  • Proper error codes, downtime procedures, and not found pages. If your 404 pages turn out to all be 200 pages, well, now you’ve made a big error there, and Google could be crawling tons of 404 pages that they think are real pages, because you’ve made it a status code 200, or you’ve used a 404 code when you should have used a 410, which is a permanently removed, to be able to get it completely out of the indexes, as opposed to having Google revisit it and keep it in the index.

Downtime procedures. So there’s specifically a… I can’t even remember. It’s a 5xx code that you can use. Maybe it was a 503 or something that you can use that’s like, “Revisit later. We’re having some downtime right now.” Google urges you to use that specific code rather than using a 404, which tells them, “This page is now an error.”

Disney had that problem a while ago, if you guys remember, where they 404ed all their pages during an hour of downtime, and then their homepage, when you searched for Disney World, was, like, “Not found.” Oh, jeez, Disney World, not so good.

  • International and multi-language targeting issues. I won’t go into that. But you have to know the protocols there. Duplicate content, syndication, scrapers. How do we handle all that? Somebody else wants to take our content, put it on their site, what should we do? Someone’s scraping our content. What can we do? We have duplicate content on our own site. What should we do?
  • Diagnosing traffic drops via analytics and metrics. Being able to look at a rankings report, being able to look at analytics connecting those up and trying to see: Why did we go up or down? Did we have less pages being indexed, more pages being indexed, more pages getting traffic less, more keywords less?
  • Understanding advanced search parameters. Today, just today, I was checking out the related parameter in Google, which is fascinating for most sites. Well, for Moz, weirdly, related:oursite.com shows nothing. But for virtually every other sit, well, most other sites on the web, it does show some really interesting data, and you can see how Google is connecting up, essentially, intentions and topics from different sites and pages, which can be fascinating, could expose opportunities for links, could expose understanding of how they view your site versus your competition or who they think your competition is.

Then there are tons of parameters, like in URL and in anchor, and da, da, da, da. In anchor doesn’t work anymore, never mind about that one.

I have to go faster, because we’re just going to run out of these. Like, come on. Interpreting and leveraging data in Google Search Console. If you don’t know how to use that, Google could be telling you, you have all sorts of errors, and you don’t know what they are.

  • Leveraging topic modeling and extraction. Using all these cool tools that are coming out for better keyword research and better on-page targeting. I talked about a couple of those at MozCon, like MonkeyLearn. There’s the new Moz Context API, which will be coming out soon, around that. There’s the Alchemy API, which a lot of folks really like and use.
  • Identifying and extracting opportunities based on site crawls. You run a Screaming Frog crawl on your site and you’re going, “Oh, here’s all these problems and issues.” If you don’t have these technical skills, you can’t diagnose that. You can’t figure out what’s wrong. You can’t figure out what needs fixing, what needs addressing.
  • Using rich snippet format to stand out in the SERPs. This is just getting a better click-through rate, which can seriously help your site and obviously your traffic.
  • Applying Google-supported protocols like rel=canonical, meta description, rel=prev/next, hreflang, robots.txt, meta robots, x robots, NOODP, XML sitemaps, rel=nofollow. The list goes on and on and on. If you’re not technical, you don’t know what those are, you think you just need to write good content and lower your bounce rate, it’s not going to work.
  • Using APIs from services like AdWords or MozScape, or hrefs from Majestic, or SEM refs from SearchScape or Alchemy API. Those APIs can have powerful things that they can do for your site. There are some powerful problems they could help you solve if you know how to use them. It’s actually not that hard to write something, even inside a Google Doc or Excel, to pull from an API and get some data in there. There’s a bunch of good tutorials out there. Richard Baxter has one, Annie Cushing has one, I think Distilled has some. So really cool stuff there.
  • Diagnosing page load speed issues, which goes right to what Jayson was talking about. You need that fast-loading page. Well, if you don’t have any technical skills, you can’t figure out why your page might not be loading quickly.
  • Diagnosing mobile friendliness issues
  • Advising app developers on the new protocols around App deep linking, so that you can get the content from your mobile apps into the web search results on mobile devices. Awesome. Super powerful. Potentially crazy powerful, as mobile search is becoming bigger than desktop.

Okay, I’m going to take a deep breath and relax. I don’t know Jayson’s intention, and in fact, if he were in this room, he’d be like, “No, I totally agree with all those things. I wrote the article in a rush. I had no idea it was going to be big. I was just trying to make the broader points around you don’t have to be a coder in order to do SEO.” That’s completely fine.

So I’m not going to try and rain criticism down on him. But I think if you’re reading that article, or you’re seeing it in your feed, or your clients are, or your boss is, or other folks are in your world, maybe you can point them to this Whiteboard Friday and let them know, no, that’s not quite right. There’s a ton of technical SEO that is required in 2015 and will be for years to come, I think, that SEOs have to have in order to be effective at their jobs.

All right, everyone. Look forward to some great comments, and we’ll see you again next time for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

Distance from Perfect

Posted by wrttnwrd

In spite of all the advice, the strategic discussions and the conference talks, we Internet marketers are still algorithmic thinkers. That’s obvious when you think of SEO.

Even when we talk about content, we’re algorithmic thinkers. Ask yourself: How many times has a client asked you, “How much content do we need?” How often do you still hear “How unique does this page need to be?”

That’s 100% algorithmic thinking: Produce a certain amount of content, move up a certain number of spaces.

But you and I know it’s complete bullshit.

I’m not suggesting you ignore the algorithm. You should definitely chase it. Understanding a little bit about what goes on in Google’s pointy little head helps. But it’s not enough.

A tale of SEO woe that makes you go “whoa”

I have this friend.

He ranked #10 for “flibbergibbet.” He wanted to rank #1.

He compared his site to the #1 site and realized the #1 site had five hundred blog posts.

“That site has five hundred blog posts,” he said, “I must have more.”

So he hired a few writers and cranked out five thousand blogs posts that melted Microsoft Word’s grammar check. He didn’t move up in the rankings. I’m shocked.

“That guy’s spamming,” he decided, “I’ll just report him to Google and hope for the best.”

What happened? Why didn’t adding five thousand blog posts work?

It’s pretty obvious: My, uh, friend added nothing but crap content to a site that was already outranked. Bulk is no longer a ranking tactic. Google’s very aware of that tactic. Lots of smart engineers have put time into updates like Panda to compensate.

He started like this:

And ended up like this:
more posts, no rankings

Alright, yeah, I was Mr. Flood The Site With Content, way back in 2003. Don’t judge me, whippersnappers.

Reality’s never that obvious. You’re scratching and clawing to move up two spots, you’ve got an overtasked IT team pushing back on changes, and you’ve got a boss who needs to know the implications of every recommendation.

Why fix duplication if rel=canonical can address it? Fixing duplication will take more time and cost more money. It’s easier to paste in one line of code. You and I know it’s better to fix the duplication. But it’s a hard sell.

Why deal with 302 versus 404 response codes and home page redirection? The basic user experience remains the same. Again, we just know that a server should return one home page without any redirects and that it should send a ‘not found’ 404 response if a page is missing. If it’s going to take 3 developer hours to reconfigure the server, though, how do we justify it? There’s no flashing sign reading “Your site has a problem!”

Why change this thing and not that thing?

At the same time, our boss/client sees that the site above theirs has five hundred blog posts and thousands of links from sites selling correspondence MBAs. So they want five thousand blog posts and cheap links as quickly as possible.

Cue crazy music.

SEO lacks clarity

SEO is, in some ways, for the insane. It’s an absurd collection of technical tweaks, content thinking, link building and other little tactics that may or may not work. A novice gets exposed to one piece of crappy information after another, with an occasional bit of useful stuff mixed in. They create sites that repel search engines and piss off users. They get more awful advice. The cycle repeats. Every time it does, best practices get more muddled.

SEO lacks clarity. We can’t easily weigh the value of one change or tactic over another. But we can look at our changes and tactics in context. When we examine the potential of several changes or tactics before we flip the switch, we get a closer balance between algorithm-thinking and actual strategy.

Distance from perfect brings clarity to tactics and strategy

At some point you have to turn that knowledge into practice. You have to take action based on recommendations, your knowledge of SEO, and business considerations.

That’s hard when we can’t even agree on subdomains vs. subfolders.

I know subfolders work better. Sorry, couldn’t resist. Let the flaming comments commence.

To get clarity, take a deep breath and ask yourself:

“All other things being equal, will this change, tactic, or strategy move my site closer to perfect than my competitors?”

Breaking it down:

“Change, tactic, or strategy”

A change takes an existing component or policy and makes it something else. Replatforming is a massive change. Adding a new page is a smaller one. Adding ALT attributes to your images is another example. Changing the way your shopping cart works is yet another.

A tactic is a specific, executable practice. In SEO, that might be fixing broken links, optimizing ALT attributes, optimizing title tags or producing a specific piece of content.

A strategy is a broader decision that’ll cause change or drive tactics. A long-term content policy is the easiest example. Shifting away from asynchronous content and moving to server-generated content is another example.

“Perfect”

No one knows exactly what Google considers “perfect,” and “perfect” can’t really exist, but you can bet a perfect web page/site would have all of the following:

  1. Completely visible content that’s perfectly relevant to the audience and query
  2. A flawless user experience
  3. Instant load time
  4. Zero duplicate content
  5. Every page easily indexed and classified
  6. No mistakes, broken links, redirects or anything else generally yucky
  7. Zero reported problems or suggestions in each search engines’ webmaster tools, sorry, “Search Consoles”
  8. Complete authority through immaculate, organically-generated links

These 8 categories (and any of the other bazillion that probably exist) give you a way to break down “perfect” and help you focus on what’s really going to move you forward. These different areas may involve different facets of your organization.

Your IT team can work on load time and creating an error-free front- and back-end. Link building requires the time and effort of content and outreach teams.

Tactics for relevant, visible content and current best practices in UX are going to be more involved, requiring research and real study of your audience.

What you need and what resources you have are going to impact which tactics are most realistic for you.

But there’s a basic rule: If a website would make Googlebot swoon and present zero obstacles to users, it’s close to perfect.

“All other things being equal”

Assume every competing website is optimized exactly as well as yours.

Now ask: Will this [tactic, change or strategy] move you closer to perfect?

That’s the “all other things being equal” rule. And it’s an incredibly powerful rubric for evaluating potential changes before you act. Pretend you’re in a tie with your competitors. Will this one thing be the tiebreaker? Will it put you ahead? Or will it cause you to fall behind?

“Closer to perfect than my competitors”

Perfect is great, but unattainable. What you really need is to be just a little perfect-er.

Chasing perfect can be dangerous. Perfect is the enemy of the good (I love that quote. Hated Voltaire. But I love that quote). If you wait for the opportunity/resources to reach perfection, you’ll never do anything. And the only way to reduce distance from perfect is to execute.

Instead of aiming for pure perfection, aim for more perfect than your competitors. Beat them feature-by-feature, tactic-by-tactic. Implement strategy that supports long-term superiority.

Don’t slack off. But set priorities and measure your effort. If fixing server response codes will take one hour and fixing duplication will take ten, fix the response codes first. Both move you closer to perfect. Fixing response codes may not move the needle as much, but it’s a lot easier to do. Then move on to fixing duplicates.

Do the 60% that gets you a 90% improvement. Then move on to the next thing and do it again. When you’re done, get to work on that last 40%. Repeat as necessary.

Take advantage of quick wins. That gives you more time to focus on your bigger solutions.

Sites that are “fine” are pretty far from perfect

Google has lots of tweaks, tools and workarounds to help us mitigate sub-optimal sites:

  • Rel=canonical lets us guide Google past duplicate content rather than fix it
  • HTML snapshots let us reveal content that’s delivered using asynchronous content and JavaScript frameworks
  • We can use rel=next and prev to guide search bots through outrageously long pagination tunnels
  • And we can use rel=nofollow to hide spammy links and banners

Easy, right? All of these solutions may reduce distance from perfect (the search engines don’t guarantee it). But they don’t reduce it as much as fixing the problems.
Just fine does not equal fixed

The next time you set up rel=canonical, ask yourself:

“All other things being equal, will using rel=canonical to make up for duplication move my site closer to perfect than my competitors?”

Answer: Not if they’re using rel=canonical, too. You’re both using imperfect solutions that force search engines to crawl every page of your site, duplicates included. If you want to pass them on your way to perfect, you need to fix the duplicate content.

When you use Angular.js to deliver regular content pages, ask yourself:

“All other things being equal, will using HTML snapshots instead of actual, visible content move my site closer to perfect than my competitors?”

Answer: No. Just no. Not in your wildest, code-addled dreams. If I’m Google, which site will I prefer? The one that renders for me the same way it renders for users? Or the one that has to deliver two separate versions of every page?

When you spill banner ads all over your site, ask yourself…

You get the idea. Nofollow is better than follow, but banner pollution is still pretty dang far from perfect.

Mitigating SEO issues with search engine-specific tools is “fine.” But it’s far, far from perfect. If search engines are forced to choose, they’ll favor the site that just works.

Not just SEO

By the way, distance from perfect absolutely applies to other channels.

I’m focusing on SEO, but think of other Internet marketing disciplines. I hear stuff like “How fast should my site be?” (Faster than it is right now.) Or “I’ve heard you shouldn’t have any content below the fold.” (Maybe in 2001.) Or “I need background video on my home page!” (Why? Do you have a reason?) Or, my favorite: “What’s a good bounce rate?” (Zero is pretty awesome.)

And Internet marketing venues are working to measure distance from perfect. Pay-per-click marketing has the quality score: A codified financial reward applied for seeking distance from perfect in as many elements as possible of your advertising program.

Social media venues are aggressively building their own forms of graphing, scoring and ranking systems designed to separate the good from the bad.

Really, all marketing includes some measure of distance from perfect. But no channel is more influenced by it than SEO. Instead of arguing one rule at a time, ask yourself and your boss or client: Will this move us closer to perfect?

Hell, you might even please a customer or two.

One last note for all of the SEOs in the crowd. Before you start pointing out edge cases, consider this: We spend our days combing Google for embarrassing rankings issues. Every now and then, we find one, point, and start yelling “SEE! SEE!!!! THE GOOGLES MADE MISTAKES!!!!” Google’s got lots of issues. Screwing up the rankings isn’t one of them.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

Controlling Search Engine Crawlers for Better Indexation and Rankings – Whiteboard Friday

Posted by randfish

When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today’s Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don’t want them to go. We’re going to talk a little bit about crawl budget and what you should and shouldn’t have indexed.

As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.

There are a few others that we’re going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let’s dive into those first few first.

Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn’t access, it doesn’t always get respected by Google and Bing. So a lot of folks when you say, “hey, disallow this,” and then you suddenly see those URLs popping up and you’re wondering what’s going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you’ve made a mistake, they think “hey, there’s a lot of links pointing to this content, there’s a lot of people who are visiting and caring about this content, maybe you didn’t intend for us to block it.” The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say “everything behind this entire big directory,” the worse they are about necessarily believing you.

Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it’s usually a lot more respected, because it’s at an individual-page level; Google and Bing tend to believe you about the meta robots tag.

And then the nofollow tag, that lives on an individual link on a page. It doesn’t tell engines where to crawl or not to crawl. All it’s saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.

Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.

What if, for example, we take a page like “blogtest.html” on our domain and we say “all user agents, you are not allowed to crawl blogtest.html. Okay—that’s a good way to keep that page away from being crawled, but just because something is not crawled doesn’t necessarily mean it won’t be in the search results.

So then we have our SEO folks go, “you know what, let’s make doubly sure that doesn’t show up in search results; we’ll put in the meta robots tag:”

<meta name="robots" content="noindex, follow">

So, “noindex, follow” tells the search engine crawler they can follow the links on the page, but they shouldn’t index this particular one.

Then, you go and run a search for “blog test” in this case, and everybody on the team’s like “What the heck!? WTF? Why am I seeing this page show up in search results?”

The answer is, you told the engines that they couldn’t crawl the page, so they didn’t. But they are still putting it in the results. They’re actually probably not going to include a meta description; they might have something like “we can’t include a meta description because of this site’s robots.txt file.” The reason it’s showing up is because they can’t see the noindex; all they see is the disallow.

So, if you want something truly removed, unable to be seen in search results, you can’t just disallow a crawler. You have to say meta “noindex” and you have to let them crawl it.

So this creates some complications. Robots.txt can be great if we’re trying to save crawl bandwidth, but it isn’t necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying “Google, don’t crawl the www version of twitter.com.” What you should be doing is rel canonical-ing or using a 301.

Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.

The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.

Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404’d something or if you’ve told them not to crawl something but it’s still showing up in there, you can manually say “don’t do that.” There are a few other crawl protocol things that you can do.

And then URL status codes—these are a valid way to do things, but they’re going to obviously change what’s going on on your pages, too.

If you’re not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines “it’s back!” 410 is permanent removal.

301—permanent redirect, we’ve talked about those here—and 302, temporary redirect.

Now let’s jump into a few specific use cases of “what kinds of content should and shouldn’t I allow engines to crawl and index” in this next version…

[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]

Four crawling/indexing problems to solve

So we’ve got these four big problems that I want to talk about as they relate to crawling and indexing.

1. Content that isn’t ready yet

The first one here is around, “If I have content of quality I’m still trying to improve—it’s not yet ready for primetime, it’s not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I’m rewriting the content and creating unique value on those pages… they’re just not ready yet—what should I do with those?”

My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I’d disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.

If I’m talking about a small quantity—a few dozen, a few hundred pages—well, I’d probably just use the meta robots noindex, and then I’d pull that noindex off of those pages as they are made ready for Google’s consumption. And then again, I would probably use the XML sitemap and start submitting those once they’re ready.

2. Dealing with duplicate or thin content

What about, “Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?” I’ve got an example. Let’s say I’m an ecommerce shop, I’m selling this nice Star Wars t-shirt which I think is kind of hilarious, so I’ve got starwarsshirt.html, and it links out to a larger version of an image, and that’s an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn’t recommend disallowing crawling on these, and I wouldn’t recommend noindexing them. What I would do there is a rel canonical.

Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn’t see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don’t get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta “noindex, follow” on these pages, but I don’t really think that’s necessary, and again that might interfere with the rel canonical.

3. Passing link equity without appearing in search results

Number three: “If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don’t need those appearing in search results—what should I use then?”

What I would say here is, you can use the meta robots to say “don’t index the page, but do follow the links that are on that page.” That’s a pretty nice, handy use case for that.

Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can’t see the noindex. They don’t know that they can follow it. Granted, as we talked about before, sometimes Google doesn’t obey the robots.txt, but you can’t rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots “noindex, follow” is the way to do this.

4. Search results-type pages

Finally, fourth, “What should I do with search results-type pages?” Google has said many times that they don’t like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.

Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you’ve got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, “I’m looking for restaurants in Seattle, WA,” they’ll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that’s not just a list of search results, that is more of a landing page than a search results page.

However, that being said, if you’ve got a long tail of these, or if you’d say “hey, our internal search engine, that’s really for internal visitors only—it’s not useful to have those pages show up in search results, and we don’t think we need to make the effort to make those into category landing pages.” Then you can use the disallow in robots.txt to prevent those.

Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren’t valuable pages that should be served up and turned into landing pages. If you’re sure, then go ahead and disallow all your search results-style pages. You’ll see a lot of sites doing this in their robots.txt file.

That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I’ll try and tackle those in the comments below.

We’ll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

An Open-Source Tool for Checking rel-alternate-hreflang Annotations

Posted by Tom-Anthony

In the Distilled R&D department we have been ramping up the amount of automated monitoring and analysis we do, with an internal system monitoring our client’s sites both directly and via various data sources to ensure they remain healthy and we are alerted to any problems that may arise.

Recently we started work to add in functionality for including the rel-alternate-hreflang annotations in this system. In this blog post I’m going to share an open-source Python library we’ve just started work on for the purpose, which makes it easy to read the hreflang entries from a page and identify errors with them.

If you’re not a Python aficionado then don’t despair, as I have also built a ready-to-go tool for you to use, which will quickly do some checks on the hreflang entries for any URL you specify. 🙂

Google’s Search Console (formerly Webmaster Tools) does have some basic rel-alternate-hreflang checking built in, but it is limited in how you can use it and you are restricted to using it for verified sites.

rel-alternate-hreflang checklist

Before we introduce the code, I wanted to quickly review a list of five easy and common mistakes that we will want to check for when looking at rel-alternate-hreflang annotations:

  • return tag errors – Every alternate language/locale URL of a page should, itself, include a link back to the first page. This makes sense but I’ve seen people make mistakes with it fairly often.
  • indirect / broken links – Links to alternate language/region versions of the page should no go via redirects, and should not link to missing or broken pages.
  • multiple entries – There should never be multiple entries for a single language/region combo.
  • multiple defaults – You should never have more than one x-default entry.
  • conflicting modes – rel-alternate-hreflang entries can be implemented via inline HTML, XML sitemaps, or HTTP headers. For any one set of pages only one implementation mode should be used.

So now imagine that we want to simply automate these checks quickly and simply…

Introducing: polly – the hreflang checker library

polly is the name for the library we have developed to help us solve this problem, and we are releasing it as open source so the SEO community can use it freely to build upon. We only started work on it last week, but we plan to continue developing it, and will also accept contributions to the code from the community, so we expect its feature set to grow rapidly.

If you are not comfortable tinkering with Python, then feel free to skip down to the next section of the post, where there is a tool that is built with polly which you can use right away.

Still here? Ok, great. You can install polly easily via pip:

pip install polly

You can then create a PollyPage() object which will do all our work and store the data simply by instantiating the class with the desired URL:

my_page = PollyPage("http://www.facebook.com/")

You can quickly see the hreflang entries on the page by running:

print my_page.alternate_urls_map

You can list all the hreflang values encountered on a page, and which countries and languages they cover:

print my_page.hreflang_values
print my_page.languages
print my_page.regions

You can also check various aspects of a page, see whether the pages it includes in its rel-alternate-hreflang entries point back, or whether there are entries that do not see retrievable (due to 404 or 500 etc. errors):

print my_page.is_default
print my_page.no_return_tag_pages()
print my_page.non_retrievable_pages()

Get more instructions and grab the code at the polly github page. Hit me up in the comments with any questions.

Free tool: hreflang.ninja

I have put together a very simple tool that uses polly to run some of the checks we highlighted above as being common mistakes with rel-alternate-hreflang, which you can visit right now and start using:

http://hreflang.ninja

Simply enter a URL and hit enter, and you should see something like:

Example output from the ninja!

The tool shows you the rel-alternate-hreflang entries found on the page, the language and region of those entries, the alternate URLs, and any errors identified with the entry. It is perfect for doing quick’n’dirty checks of a URL to identify any errors.

As we add additional functionality to polly we will be updating hreflang.ninja as well, so please tweet me with feature ideas or suggestions.

To-do list!

This is the first release of polly and currently we only handle annotations that are in the HTML of the page, not those in the XML sitemap or HTTP headers. However, we are going to be updating polly (and hreflang.ninja) over the coming weeks, so watch this space! 🙂

Resources

Here are a few links you may find helpful for hreflang:

Got suggestions?

With the increasing number of SEO directives and annotations available, and the ever-changing guidelines around how to deploy them, it is important to automate whatever areas possible. Hopefully polly is helpful to the community in this regard, and we want to here what ideas you have for making these tools more useful – here in the comments or via Twitter.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

The Meta Referrer Tag: An Advancement for SEO and the Internet

Posted by Cyrus-Shepard

The movement to make the Internet more secure through HTTPS brings several useful advancements for webmasters. In addition to security improvements, HTTPS promises future technological advances and potential SEO benefits for marketers.

HTTPS in search results is rising. Recent MozCast data from Dr. Pete shows nearly 20% of first page Google results are now HTTPS.

Sadly, HTTPS also has its downsides.

Marketers run into their first challenge when they switch regular HTTP sites over to HTTPS. Technically challenging, the switch typically involves routing your site through a series of 301 redirects. Historically, these types of redirects are associated with a loss of link equity (thought to be around 15%) which can lead to a loss in rankings. This can offset any SEO advantage that Google claims switching.

Ross Hudgens perfectly summed it up in this tweet:

Many SEOs have anecdotally shared stories of HTTPS sites performing well in Google search results (and our soon-to-be-published Ranking Factors data seems to support this.) However, the short term effect of a large migration can be hard to take. When Moz recently switched to HTTPS to provide better security to our logged-in users, we saw an 8-9% dip in our organic search traffic.

Problem number two is the subject of this post. It involves the loss of referral data. Typically, when one site sends traffic to another, information is sent that identifies the originating site as the source of traffic. This invaluable data allows people to see where their traffic is coming from, and helps spread the flow of information across the web.

SEOs have long used referrer data for a number of beneficial purposes. Oftentimes, people will link back or check out the site sending traffic when they see the referrer in their analytics data. Spammers know this works, as evidenced by the recent increase in referrer spam:

This process stops when traffic flows from an HTTPS site to a non-secure HTTP site. In this case, no referrer data is sent. Webmasters can’t know where their traffic is coming from.

Here’s how referral data to my personal site looked when Moz switched to HTTPS. I lost all visibility into where my traffic came from.

Its (not provided) all over again!

Enter the meta referrer tag

While we can’t solve the ranking challenges imposed by switching a site to HTTPS, we can solve the loss of referral data, and it’s actually super-simple.

Almost completely unknown to most marketers, the relatively new meta referrer tag (it’s actually been around for a few years) was designed to help out in these situations.

Better yet, the tag allows you to control how your referrer information is passed.

The meta referrer tag works with most browsers to pass referrer information in a manner defined by the user. Traffic remains encrypted and all the benefits of using HTTPS remain in place, but now you can pass referrer data to all websites, even those that use HTTP.

How to use the meta referrer tag

What follows are extremely simplified instructions for using the meta referrer tag. For more in-depth understanding, we highly recommend referring to the W3C working draft of the spec.

The meta referrer tag is placed in the <head> section of your HTML, and references one of five states, which control how browsers send referrer information from your site. The five states are:

  1. None: Never pass referral data
    <meta name="referrer" content="none">
    
  2. None When Downgrade: Sends referrer information to secure HTTPS sites, but not insecure HTTP sites
    <meta name="referrer" content="none-when-downgrade">
    
  3. Origin Only: Sends the scheme, host, and port (basically, the subdomain) stripped of the full URL as a referrer, i.e. https://moz.com/example.html would simply send https://moz.com
    <meta name="referrer" content="origin">
    

  4. Origin When Cross-Origin: Sends the full URL as the referrer when the target has the same scheme, host, and port (i.e. subdomain) regardless if it’s HTTP or HTTPS, while sending origin-only referral information to external sites. (note: There is a typo in the official spec. Future versions should be “origin-when-cross-origin”)
    <meta name="referrer" content="origin-when-crossorigin">
    
  5. Unsafe URL: Always passes the URL string as a referrer. Note if you have any sensitive information contained in your URL, this isn’t the safest option. By default, URL fragments, username, and password are automatically stripped out.
    <meta name="referrer" content="unsafe-url">
    

The meta referrer tag in action

By clicking the link below, you can get a sense of how the meta referrer tag works.

Check Referrer

Boom!

We’ve set the meta referrer tag for Moz to “origin”, which means when we link out to another site, we pass our scheme, host, and port. The end result is you see http://moz.com as the referrer, stripped of the full URL path (/meta-referrer-tag).

My personal site typically receives several visits per day from Moz. Here’s what my analytics data looked like before and after we implemented the meta referrer tag.

For simplicity and security, most sites may want to implement the “origin” state, but there are drawbacks.

One negative side effect was that as soon as we implemented the meta referrer tag, our AdRoll analytics, which we use for retargeting, stopped working. It turns out that AdRoll uses our referrer information for analytics, but the meta referrer tag “origin” state meant that the only URL they ever saw reported was https://moz.com.

Conclusion

We love the meta referrer tag because it keeps information flowing on the Internet. It’s the way the web is supposed to work!

It helps marketers and webmasters see exactly where their traffic is coming from. It encourages engagement, communication, and even linking, which can lead to improvements in SEO.

Useful links:

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

Eliminate Duplicate Content in Faceted Navigation with Ajax/JSON/JQuery

Posted by EricEnge

One of the classic problems in SEO is that while complex navigation schemes may be useful to users, they create problems for search engines. Many publishers rely on tags such as rel=canonical, or the parameters settings in Webmaster Tools to try and solve these types of issues. However, each of the potential solutions has limitations. In today’s post, I am going to outline how you can use JavaScript solutions to more completely eliminate the problem altogether.

Note that I am not going to provide code examples in this post, but I am going to outline how it works on a conceptual level. If you are interested in learning more about Ajax/JSON/jQuery here are some resources you can check out:

  1. Ajax Tutorial
  2. Learning Ajax/jQuery

Defining the problem with faceted navigation

Having a page of products and then allowing users to sort those products the way they want (sorted from highest to lowest price), or to use a filter to pick a subset of the products (only those over $60) makes good sense for users. We typically refer to these types of navigation options as “faceted navigation.”

However, faceted navigation can cause problems for search engines because they don’t want to crawl and index all of your different sort orders or all your different filtered versions of your pages. They would end up with many different variants of your pages that are not significantly different from a search engine user experience perspective.

Solutions such as rel=canonical tags and parameters settings in Webmaster Tools have some limitations. For example, rel=canonical tags are considered “hints” by the search engines, and they may not choose to accept them, and even if they are accepted, they do not necessarily keep the search engines from continuing to crawl those pages.

A better solution might be to use JSON and jQuery to implement your faceted navigation so that a new page is not created when a user picks a filter or a sort order. Let’s take a look at how it works.

Using JSON and jQuery to filter on the client side

The main benefit of the implementation discussed below is that a new URL is not created when a user is on a page of yours and applies a filter or sort order. When you use JSON and jQuery, the entire process happens on the client device without involving your web server at all.

When a user initially requests one of the product pages on your web site, the interaction looks like this:

using json on faceted navigation

This transfers the page to the browser the user used to request the page. Now when a user picks a sort order (or filter) on that page, here is what happens:

jquery and faceted navigation diagram

When the user picks one of those options, a jQuery request is made to the JSON data object. Translation: the entire interaction happens within the client’s browser and the sort or filter is applied there. Simply put, the smarts to handle that sort or filter resides entirely within the code on the client device that was transferred with the initial request for the page.

As a result, there is no new page created and no new URL for Google or Bing to crawl. Any concerns about crawl budget or inefficient use of PageRank are completely eliminated. This is great stuff! However, there remain limitations in this implementation.

Specifically, if your list of products spans multiple pages on your site, the sorting and filtering will only be applied to the data set already transferred to the user’s browser with the initial request. In short, you may only be sorting the first page of products, and not across the entire set of products. It’s possible to have the initial JSON data object contain the full set of pages, but this may not be a good idea if the page size ends up being large. In that event, we will need to do a bit more.

What Ajax does for you

Now we are going to dig in slightly deeper and outline how Ajax will allow us to handle sorting, filtering, AND pagination. Warning: There is some tech talk in this section, but I will try to follow each technical explanation with a layman’s explanation about what’s happening.

The conceptual Ajax implementation looks like this:

ajax and faceted navigation diagram

In this structure, we are using an Ajax layer to manage the communications with the web server. Imagine that we have a set of 10 pages, the user has gotten the first page of those 10 on their device and then requests a change to the sort order. The Ajax requests a fresh set of data from the web server for your site, similar to a normal HTML transaction, except that it runs asynchronously in a separate thread.

If you don’t know what that means, the benefit is that the rest of the page can load completely while the process to capture the data that the Ajax will display is running in parallel. This will be things like your main menu, your footer links to related products, and other page elements. This can improve the perceived performance of the page.

When a user selects a different sort order, the code registers an event handler for a given object (e.g. HTML Element or other DOM objects) and then executes an action. The browser will perform the action in a different thread to trigger the event in the main thread when appropriate. This happens without needing to execute a full page refresh, only the content controlled by the Ajax refreshes.

To translate this for the non-technical reader, it just means that we can update the sort order of the page, without needing to redraw the entire page, or change the URL, even in the case of a paginated sequence of pages. This is a benefit because it can be faster than reloading the entire page, and it should make it clear to search engines that you are not trying to get some new page into their index.

Effectively, it does this within the existing Document Object Model (DOM), which you can think of as the basic structure of the documents and a spec for the way the document is accessed and manipulated.

How will Google handle this type of implementation?

For those of you who read Adam Audette’s excellent recent post on the tests his team performed on how Google reads Javascript, you may be wondering if Google will still load all these page variants on the same URL anyway, and if they will not like it.

I had the same question, so I reached out to Google’s Gary Illyes to get an answer. Here is the dialog that transpired:

Eric Enge: I’d like to ask you about using JSON and jQuery to render different sort orders and filters within the same URL. I.e. the user selects a sort order or a filter, and the content is reordered and redrawn on the page on the client site. Hence no new URL would be created. It’s effectively a way of canonicalizing the content, since each variant is a strict subset.

Then there is a second level consideration with this approach, which involves doing the same thing with pagination. I.e. you have 10 pages of products, and users still have sorting and filtering options. In order to support sorting and filtering across the entire 10 page set, you use an Ajax solution, so all of that still renders on one URL.

So, if you are on page 1, and a user executes a sort, they get that all back in that one page. However, to do this right, going to page 2 would also render on the same URL. Effectively, you are taking the 10 page set and rendering it all within one URL. This allows sorting, filtering, and pagination without needing to use canonical, noindex, prev/next, or robots.txt.

If this was not problematic for Google, the only downside is that it makes the pagination not visible to Google. Does that make sense, or is it a bad idea?

Gary Illyes
: If you have one URL only, and people have to click on stuff to see different sort orders or filters for the exact same content under that URL, then typically we would only see the default content.

If you don’t have pagination information, that’s not a problem, except we might not see the content on the other pages that are not contained in the HTML within the initial page load. The meaning of rel-prev/next is to funnel the signals from child pages (page 2, 3, 4, etc.) to the group of pages as a collection, or to the view-all page if you have one. If you simply choose to render those paginated versions on a single URL, that will have the same impact from a signals point of view, meaning that all signals will go to a single entity, rather than distributed to several URLs.

Summary

Keep in mind, the reason why Google implemented tags like rel=canonical, NoIndex, rel=prev/next, and others is to reduce their crawling burden and overall page bloat and to help focus signals to incoming pages in the best way possible. The use of Ajax/JSON/jQuery as outlined above does this simply and elegantly.

On most e-commerce sites, there are many different “facets” of how a user might want to sort and filter a list of products. With the Ajax-style implementation, this can be done without creating new pages. The end users get the control they are looking for, the search engines don’t have to deal with excess pages they don’t want to see, and signals in to the site (such as links) are focused on the main pages where they should be.

The one downside is that Google may not see all the content when it is paginated. A site that has lots of very similar products in a paginated list does not have to worry too much about Google seeing all the additional content, so this isn’t much of a concern if your incremental pages contain more of what’s on the first page. Sites that have content that is materially different on the additional pages, however, might not want to use this approach.

These solutions do require Javascript coding expertise but are not really that complex. If you have the ability to consider a path like this, you can free yourself from trying to understand the various tags, their limitations, and whether or not they truly accomplish what you are looking for.

Credit: Thanks for Clark Lefavour for providing a review of the above for technical correctness.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]

5 Spreadsheet Tips for Manual Link Audits

Posted by MarieHaynes

Link auditing is the part of my job that I love the most. I have audited a LOT of links over the last few years. While there are some programs out there that can be quite helpful to the avid link auditor, I still prefer to create a spreadsheet of my links in Excel and then to audit those links one-by-one from within Google Spreadsheets. Over the years I have learned a few tricks and formulas that have helped me in this process. In this article, I will share several of these with you.

Please know that while I am quite comfortable being labelled a link auditing expert, I am not an Excel wizard. I am betting that some of the things that I am doing could be improved upon if you’re an advanced user. As such, if you have any suggestions or tips of your own I’d love to hear them in the comments section!

1. Extract the domain or subdomain from a URL

OK. You’ve downloaded links from as many sources as possible and now you want to manually visit and evaluate one link from every domain. But, holy moly, some of these domains can have THOUSANDS of links pointing to the site. So, let’s break these down so that you are just seeing one link from each domain. The first step is to extract the domain or subdomain from each url.

I am going to show you examples from a Google spreadsheet as I find that these display nicer for demonstration purposes. However, if you’ve got a fairly large site, you’ll find that the spreadsheets are easier to create in Excel. If you’re confused about any of these steps, check out the animated gif at the end of each step to see the process in action.

Here is how you extract a domain or subdomain from a url:

  • Create a new column to the left of your url column.
  • Use this formula:

    =LEFT(B1,FIND(“/”,B1,9)-1)

    What this will do is remove everything after the trailing slash following the domain name. http://www.example.com/article.html will now become http://www.example.com and http://www.subdomain.example.com/article.html will now become http://www.subdomain.example.com.

  • Copy our new column A and paste it right back where it was using the “paste as values” function. If you don’t do this, you won’t be able to use the Find and Replace feature.
  • Use Find and Replace to replace each of the following with a blank (i.e. nothing):
    http://
    https://
    www.

And BOOM! We are left with a column that contains just domain names and subdomain names. This animated gif shows each of the steps we just outlined:

2. Just show one link from each domain

The next step is to filter this list so that we are just seeing one link from each domain. If you are manually reviewing links, there’s usually no point in reviewing every single link from every domain. I will throw in a word of caution here though. Sometimes a domain can have both a good link and a bad link pointing to you. Or in some cases, you may find that links from one page are followed and from another page on the same site they are nofollowed. You can miss some of these by just looking at one link from each domain. Personally, I have some checks built in to my process where I use Scrapebox and some internal tools that I have created to make sure that I’m not missing the odd link by just looking at one link from each domain. For most link audits, however, you are not going to miss very much by assessing one link from each domain.

Here’s how we do it:

  • Highlight our domains column and sort the column in alphabetical order.
  • Create a column to the left of our domains, so that the domains are in column B.
  • Use this formula:

    =IF(B1=B2,”duplicate”,”unique”)

  • Copy that formula down the column.
  • Use the filter function so that you are just seeing the duplicates.
  • Delete those rows. Note: If you have tens of thousands of rows to delete, the spreadsheet may crash. A workaround here is to use “Clear Rows” instead of “Delete Rows” and then sort your domains column from A-Z once you are finished.

We’ve now got a list of one link from every domain linking to us.

Here’s the gif that shows each of these steps:

You may wonder why I didn’t use Excel’s dedupe function to simply deduplicate these entries. I have found that it doesn’t take much deduplication to crash Excel, which is why I do this step manually.

3. Finding patterns FTW!

Sometimes when you are auditing links, you’ll find that unnatural links have patterns. I LOVE when I see these, because sometimes I can quickly go through hundreds of links without having to check each one manually. Here is an example. Let’s say that your website has a bunch of spammy directory links. As you’re auditing you notice patterns such as one of these:

  • All of these directory links come from a url that contains …/computers/internet/item40682/
  • A whole bunch of spammy links that all come from a particular free subdomain like blogspot, wordpress, weebly, etc.
  • A lot of links that all contain a particular keyword for anchor text (this is assuming you’ve included anchor text in your spreadsheet when making it.)

You can quickly find all of these links and mark them as “disavow” or “keep” by doing the following:

  • Create a new column. In my example, I am going to create a new column in Column C and look for patterns in urls that are in Column B.
  • Use this formula:

    =FIND(“/item40682”,B1)
    (You would replace “item40682” with the phrase that you are looking for.)

  • Copy this formula down the column.
  • Filter your new column so that you are seeing any rows that have a number in this column. If the phrase doesn’t exist in that url, you’ll see “N/A”, and we can ignore those.
  • Now you can mark these all as disavow

4. Check your disavow file

This next tip is one that you can use to check your disavow file across your list of domains that you want to audit. The goal here is to see which links you have disavowed so that you don’t waste time reassessing them. This particular tip only works for checking links that you have disavowed on the domain level.

The first thing you’ll want to do is download your current disavow file from Google. For some strange reason, Google gives you the disavow file in CSV format. I have never understood this because they want you to upload the file in .txt. Still, I guess this is what works best for Google. All of your entries will be in column A of the CSV:

What we are going to do now is add these to a new sheet on our current spreadsheet and use a VLOOKUP function to mark which of our domains we have disavowed.

Here are the steps:

  • Create a new sheet on your current spreadsheet workbook.
  • Copy and paste column A from your disavow spreadsheet onto this new sheet. Or, alternatively, use the import function to import the entire CSV onto this sheet.
  • In B1, write “previously disavowed” and copy this down the entire column.
  • Remove the “domain:” from each of the entries by doing a Find and Replace to replace domain: with a blank.
  • Now go back to your link audit spreadsheet. If your domains are in column A and if you had, say, 1500 domains in your disavow file, your formula would look like this:

    =VLOOKUP(A1,Sheet2!$A$1:$B$1500,2,FALSE)

When you copy this formula down the spreadsheet, it will check each of your domains, and if it finds the domain in Sheet 2, it will write “previously disavowed” on our link audit spreadsheet.

Here is a gif that shows the process:

5. Make monthly or quarterly disavow work easier

That same formula described above is a great one to use if you are doing regular repeated link audits. In this case, your second sheet on your spreadsheet would contain domains that you have previously audited, and column B of this spreadsheet would say, “previously audited” rather than “previously disavowed“.

Your tips?

These are just a few of the formulas that you can use to help make link auditing work easier. But there are lots of other things you can do with Excel or Google Sheets to help speed up the process as well. If you have some tips to add, leave a comment below. Also, if you need clarification on any of these tips, I’m happy to answer questions in the comments section.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

[ccw-atrib-link]