Controlling Search Engine Crawlers for Better Indexation and Rankings – Whiteboard Friday

Posted by randfish

When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today’s Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don’t want them to go. We’re going to talk a little bit about crawl budget and what you should and shouldn’t have indexed.

As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.

There are a few others that we’re going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let’s dive into those first few first.

Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn’t access, it doesn’t always get respected by Google and Bing. So a lot of folks when you say, “hey, disallow this,” and then you suddenly see those URLs popping up and you’re wondering what’s going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you’ve made a mistake, they think “hey, there’s a lot of links pointing to this content, there’s a lot of people who are visiting and caring about this content, maybe you didn’t intend for us to block it.” The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say “everything behind this entire big directory,” the worse they are about necessarily believing you.

Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it’s usually a lot more respected, because it’s at an individual-page level; Google and Bing tend to believe you about the meta robots tag.

And then the nofollow tag, that lives on an individual link on a page. It doesn’t tell engines where to crawl or not to crawl. All it’s saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.

Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.

What if, for example, we take a page like “blogtest.html” on our domain and we say “all user agents, you are not allowed to crawl blogtest.html. Okay—that’s a good way to keep that page away from being crawled, but just because something is not crawled doesn’t necessarily mean it won’t be in the search results.

So then we have our SEO folks go, “you know what, let’s make doubly sure that doesn’t show up in search results; we’ll put in the meta robots tag:”

<meta name="robots" content="noindex, follow">

So, “noindex, follow” tells the search engine crawler they can follow the links on the page, but they shouldn’t index this particular one.

Then, you go and run a search for “blog test” in this case, and everybody on the team’s like “What the heck!? WTF? Why am I seeing this page show up in search results?”

The answer is, you told the engines that they couldn’t crawl the page, so they didn’t. But they are still putting it in the results. They’re actually probably not going to include a meta description; they might have something like “we can’t include a meta description because of this site’s robots.txt file.” The reason it’s showing up is because they can’t see the noindex; all they see is the disallow.

So, if you want something truly removed, unable to be seen in search results, you can’t just disallow a crawler. You have to say meta “noindex” and you have to let them crawl it.

So this creates some complications. Robots.txt can be great if we’re trying to save crawl bandwidth, but it isn’t necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying “Google, don’t crawl the www version of twitter.com.” What you should be doing is rel canonical-ing or using a 301.

Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.

The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.

Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404’d something or if you’ve told them not to crawl something but it’s still showing up in there, you can manually say “don’t do that.” There are a few other crawl protocol things that you can do.

And then URL status codes—these are a valid way to do things, but they’re going to obviously change what’s going on on your pages, too.

If you’re not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines “it’s back!” 410 is permanent removal.

301—permanent redirect, we’ve talked about those here—and 302, temporary redirect.

Now let’s jump into a few specific use cases of “what kinds of content should and shouldn’t I allow engines to crawl and index” in this next version…

[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]

Four crawling/indexing problems to solve

So we’ve got these four big problems that I want to talk about as they relate to crawling and indexing.

1. Content that isn’t ready yet

The first one here is around, “If I have content of quality I’m still trying to improve—it’s not yet ready for primetime, it’s not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I’m rewriting the content and creating unique value on those pages… they’re just not ready yet—what should I do with those?”

My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I’d disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.

If I’m talking about a small quantity—a few dozen, a few hundred pages—well, I’d probably just use the meta robots noindex, and then I’d pull that noindex off of those pages as they are made ready for Google’s consumption. And then again, I would probably use the XML sitemap and start submitting those once they’re ready.

2. Dealing with duplicate or thin content

What about, “Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?” I’ve got an example. Let’s say I’m an ecommerce shop, I’m selling this nice Star Wars t-shirt which I think is kind of hilarious, so I’ve got starwarsshirt.html, and it links out to a larger version of an image, and that’s an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn’t recommend disallowing crawling on these, and I wouldn’t recommend noindexing them. What I would do there is a rel canonical.

Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn’t see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don’t get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta “noindex, follow” on these pages, but I don’t really think that’s necessary, and again that might interfere with the rel canonical.

3. Passing link equity without appearing in search results

Number three: “If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don’t need those appearing in search results—what should I use then?”

What I would say here is, you can use the meta robots to say “don’t index the page, but do follow the links that are on that page.” That’s a pretty nice, handy use case for that.

Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can’t see the noindex. They don’t know that they can follow it. Granted, as we talked about before, sometimes Google doesn’t obey the robots.txt, but you can’t rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots “noindex, follow” is the way to do this.

4. Search results-type pages

Finally, fourth, “What should I do with search results-type pages?” Google has said many times that they don’t like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.

Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you’ve got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, “I’m looking for restaurants in Seattle, WA,” they’ll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that’s not just a list of search results, that is more of a landing page than a search results page.

However, that being said, if you’ve got a long tail of these, or if you’d say “hey, our internal search engine, that’s really for internal visitors only—it’s not useful to have those pages show up in search results, and we don’t think we need to make the effort to make those into category landing pages.” Then you can use the disallow in robots.txt to prevent those.

Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren’t valuable pages that should be served up and turned into landing pages. If you’re sure, then go ahead and disallow all your search results-style pages. You’ll see a lot of sites doing this in their robots.txt file.

That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I’ll try and tackle those in the comments below.

We’ll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

How Much Has Link Building Changed in Recent Years?

Posted by Paddy_Moogan

I get asked this question a lot. It’s mainly asked by people who are considering buying my link building book and want to know whether it’s still up to date. This is understandable given that the first edition was published in February 2013 and our industry has a deserved reputation for always changing.

I find myself giving the same answer, even though I’ve been asked it probably dozens of times in the last two years—”not that much”. I don’t think this is solely due to the book itself standing the test of time, although I’ll happily take a bit of credit for that 🙂 I think it’s more a sign of our industry as a whole not changing as much as we’d like to think.

I started to question myself and if I was right and honestly, it’s one of the reasons it has taken me over two years to release the second edition of the book.

So I posed this question to a group of friends not so long ago, some via email and some via a Facebook group. I was expecting to be called out by many of them because my position was that in reality, it hasn’t actually changed that much. The thing is, many of them agreed and the conversations ended with a pretty long thread with lots of insights. In this post, I’d like to share some of them, share what my position is and talk about what actually has changed.

My personal view

Link building hasn’t changed as much we think it has.

The core principles of link building haven’t changed. The signals around link building have changed, but mainly around new machine learning developments that have indirectly affected what we do. One thing that has definitely changed is the mindset of SEOs (and now clients) towards link building.

I think the last big change to link building came in April 2012 when Penguin rolled out. This genuinely did change our industry and put to bed a few techniques that should never have worked so well in the first place.

Since then, we’ve seen some things change, but the core principles haven’t changed if you want to build a business that will be around for years to come and not run the risk of being hit by a link related Google update. For me, these principles are quite simple:

  • You need to deserve links – either an asset you create or your product
  • You need to put this asset in front of a relevant audience who have the ability to share it
  • You need consistency – one new asset every year is unlikely to cut it
  • Anything that scales is at risk

For me, the move towards user data driving search results + machine learning has been the biggest change we’ve seen in recent years and it’s still going.

Let’s dive a bit deeper into all of this and I’ll talk about how this relates to link building.

The typical mindset for building links has changed

I think that most SEOs are coming round to the idea that you can’t get away with building low quality links any more, not if you want to build a sustainable, long-term business. Spammy link building still works in the short-term and I think it always will, but it’s much harder than it used to be to sustain websites that are built on spam. The approach is more “churn and burn” and spammers are happy to churn through lots of domains and just make a small profit on each one before moving onto another.

For everyone else, it’s all about the long-term and not putting client websites at risk.

This has led to many SEOs embracing different forms of link building and generally starting to use content as an asset when it comes to attracting links. A big part of me feels that it was actually Penguin in 2012 that drove the rise of content marketing amongst SEOs, but that’s a post for another day…! For today though, this goes some way towards explain the trend we see below.

Slowly but surely, I’m seeing clients come to my company already knowing that low quality link building isn’t what they want. It’s taken a few years after Penguin for it to filter down to client / business owner level, but it’s definitely happening. This is a good thing but unfortunately, the main reason for this is that most of them have been burnt in the past by SEO companies who have built low quality links without giving thought to building good quality ones too.

I have no doubt that it’s this change in mindset which has led to trends like this:

The thing is, I don’t think this was by choice.

Let’s be honest. A lot of us used the kind of link building tactics that Google no longer like because they worked. I don’t think many SEOs were under the illusion that it was genuinely high quality stuff, but it worked and it was far less risky to do than it is today. Unless you were super-spammy, the low-quality links just worked.

Fast forward to a post-Penguin world, things are far more risky. For me, it’s because of this that we see the trends like the above. As an industry, we had the easiest link building methods taken away from us and we’re left with fewer options. One of the main options is content marketing which, if you do it right, can lead to good quality links and importantly, the types of links you won’t be removing in the future. Get it wrong and you’ll lose budget and lose the trust if your boss or client in the power of content when it comes to link building.

There are still plenty of other methods to build links and sometimes we can forget this. Just look at this epic list from Jon Cooper. Even with this many tactics still available to us, it’s hard work. Way harder than it used to be.

My summary here is that as an industry, our mindset has shifted but it certainly wasn’t a voluntary shift. If the tactics that Penguin targeted still worked today, we’d still be using them.

A few other opinions…

I definitely think too many people want the next easy win. As someone surfing the edge of what Google is bringing our way, here’s my general take—SEO, in broad strokes, is changing a lot, *but* any given change is more and more niche and impacts fewer people. What we’re seeing isn’t radical, sweeping changes that impact everyone, but a sort of modularization of SEO, where we each have to be aware of what impacts our given industries, verticals, etc.”

Dr. Pete

 

I don’t feel that techniques for acquiring links have changed that much. You can either earn them through content and outreach or you can just buy them. What has changed is the awareness of “link building” outside of the SEO community. This makes link building / content marketing much harder when pitching to journalists and even more difficult when pitching to bloggers.

“Link building has to be more integrated with other channels and struggles to work in its own environment unless supported by brand, PR and social. Having other channels supporting your link development efforts also creates greater search signals and more opportunity to reach a bigger audience which will drive a greater ROI.

Carl Hendy

 

SEO has grown up in terms of more mature staff and SEOs becoming more ingrained into businesses so there is a smarter (less pressure) approach. At the same time, SEO has become more integrated into marketing and has made marketing teams and decision makers more intelligent in strategies and not pushing for the quick win. I’m also seeing that companies who used to rely on SEO and building links have gone through IPOs and the need to build 1000s of links per quarter has rightly reduced.

Danny Denhard

Signals that surround link building have changed

There is no question about this one in my mind. I actually wrote about this last year in my previous blog post where I talked about signals such as anchor text and deep links changing over time.

Many of the people I asked felt the same, here are some quotes from them, split out by the types of signal.

Domain level link metrics

I think domain level links have become increasingly important compared with page level factors, i.e. you can get a whole site ranking well off the back of one insanely strong page, even with sub-optimal PageRank flow from that page to the rest of the site.

Phil Nottingham

I’d agree with Phil here and this is what I was getting at in my previous post on how I feel “deep links” will matter less over time. It’s not just about domain level links here, it’s just as much about the additional signals available for Google to use (more on that later).

Anchor text

I’ve never liked anchor text as a link signal. I mean, who actually uses exact match commercial keywords as anchor text on the web?

SEOs. 🙂

Sure there will be natural links like this, but honestly, I struggle with the idea that it took Google so long to start turning down the dial on commercial anchor text as a ranking signal. They are starting to turn it down though, slowly but surely. Don’t get me wrong, it still matters and it still works. But like pure link spam, the barrier is a lot more lower now in terms what of constitutes too much.

Rand feels that they matter more than we’d expect and I’d mostly agree with this statement:

Exact match anchor text links still have more power than you’d expect—I think Google still hasn’t perfectly sorted what is “brand” or “branded query” from generics (i.e. they want to start ranking a new startup like meldhome.com for “Meld” if the site/brand gets popular, but they can’t quite tell the difference between that and https://moz.com/learn/seo/redirection getting a few manipulative links that say “redirect”)

Rand Fishkin

What I do struggle with though, is that Google still haven’t figured this out and that short-term, commercial anchor text spam is still so effective. Even for a short burst of time.

I don’t think link building as a concept has changed loads—but I think links as a signal have, mainly because of filters and penalties but I don’t see anywhere near the same level of impact from coverage anymore, even against 18 months ago.

Paul Rogers

New signals have been introduced

It isn’t just about established signals changing though, there are new signals too and I personally feel that this is where we’ve seen the most change in Google algorithms in recent years—going all the way back to Panda in 2011.

With Panda, we saw a new level of machine learning where it almost felt like Google had found a way of incorporating human reaction / feelings into their algorithms. They could then run this against a website and answer questions like the ones included in this post. Things such as:

  • “Would you be comfortable giving your credit card information to this site?”
  • “Does this article contain insightful analysis or interesting information that is beyond obvious?”
  • “Are the pages produced with great care and attention to detail vs. less attention to detail?”

It is a touch scary that Google was able to run machine learning against answers to questions like this and write an algorithm to predict the answers for any given page on the web. They have though and this was four years ago now.

Since then, they’ve made various moves to utilize machine learning and AI to build out new products and improve their search results. For me, this was one of the biggest and went pretty unnoticed by our industry. Well, until Hummingbird came along I feel pretty sure that we have Ray Kurzweil to thank for at least some of that.

There seems to be more weight on theme/topic related to sites, though it’s hard to tell if this is mostly link based or more user/usage data based. Google is doing a good job of ranking sites and pages that don’t earn the most links but do provide the most relevant/best answer. I have a feeling they use some combination of signals to say “people who perform searches like this seem to eventually wind up on this website—let’s rank it.” One of my favorite examples is the Audubon Society ranking for all sorts of birding-related searches with very poor keyword targeting, not great links, etc. I think user behavior patterns are stronger in the algo than they’ve ever been.

– Rand Fishkin

Leading on from what Rand has said, it’s becoming more and more common to see search results that just don’t make sense if you look at the link metrics—but are a good result.

For me, the move towards user data driving search results + machine learning advanced has been the biggest change we’ve seen in recent years and it’s still going.

Edit: since drafting this post, Tom Anthony released this excellent blog post on his views on the future of search and the shift to data-driven results. I’d recommend reading that as it approaches this whole area from a different perspective and I feel that an off-shoot of what Tom is talking about is the impact on link building.

You may be asking at this point, what does machine learning have to do with link building?

Everything. Because as strong as links are as a ranking signal, Google want more signals and user signals are far, far harder to manipulate than established link signals. Yes it can be done—I’ve seen it happen. There have even been a few public tests done. But it’s very hard to scale and I’d venture a guess that only the top 1% of spammers are capable of doing it, let alone maintaining it for a long period of time. When I think about the process for manipulation here, I actually think we go a step beyond spammers towards hackers and more cut and dry illegal activity.

For link building, this means that traditional methods of manipulating signals are going to become less and less effective as these user signals become stronger. For us as link builders, it means we can’t keep searching for that silver bullet or the next method of scaling link building just for an easy win. The fact is that scalable link building is always going to be at risk from penalization from Google—I don’t really want to live a life where I’m always worried about my clients being hit by the next update. Even if Google doesn’t catch up with a certain method, machine learning and user data mean that these methods may naturally become less effective and cost efficient over time.

There are of course other things such as social signals that have come into play. I certainly don’t feel like these are a strong ranking factor yet, but with deals like this one between Google and Twitter being signed, I wouldn’t be surprised if that ever-growing dataset is used at some point in organic results. The one advantage that Twitter has over Google is it’s breaking news freshness. Twitter is still way quicker at breaking news than Google is—140 characters in a tweet is far quicker than Google News! Google know this which is why I feel they’ve pulled this partnership back into existence after a couple of years apart.

There is another important point to remember here and it’s nicely summarised by Dr. Pete:

At the same time, as new signals are introduced, these are layers not replacements. People hear social signals or user signals or authorship and want it to be the link-killer, because they already fucked up link-building, but these are just layers on top of on-page and links and all of the other layers. As each layer is added, it can verify the layers that came before it and what you need isn’t the magic signal but a combination of signals that generally matches what Google expects to see from real, strong entities. So, links still matter, but they matter in concert with other things, which basically means it’s getting more complicated and, frankly, a bit harder. Of course, on one wants to hear that.”

– Dr. Pete

The core principles have not changed

This is the crux of everything for me. With all the changes listed above, the key is that the core principles around link building haven’t changed. I could even argue that Penguin didn’t change the core principles because the techniques that Penguin targeted should never have worked in the first place. I won’t argue this too much though because even Google advised website owners to build directory links at one time.

You need an asset

You need to give someone a reason to link to you. Many won’t do it out of the goodness of their heart! One of the most effective ways to do this is to develop a content asset and use this as your reason to make people care. Once you’ve made someone care, they’re more likely to share the content or link to it from somewhere.

You need to promote that asset to the right audience

I really dislike the stance that some marketers take when it comes to content promotion—build great content and links will come.

No. Sorry but for the vast majority of us, that’s simply not true. The exceptions are people that sky dive from space or have huge existing audiences to leverage.

You simply have to spend time promoting your content or your asset for it to get shares and links. It is hard work and sometimes you can spend a long time on it and get little return, but it’s important to keep working at until you’re at a point where you have two things:

  • A big enough audience where you can almost guarantee at least some traffic to your new content along with some shares
  • Enough strong relationships with relevant websites who you can speak to when new content is published and stand a good chance of them linking to it

Getting to this point is hard—but that’s kind of the point. There are various hacks you can use along the way but it will take time to get right.

You need consistency

Leading on from the previous point. It takes time and hard work to get links to your content—the types of links that stand the test of time and you’re not going to be removing in 12 months time anyway! This means that you need to keep pushing content out and getting better each and every time. This isn’t to say you should just churn content out for the sake of it, far from it. I am saying that with each piece of content you create, you will learn to do at least one thing better the next time. Try to give yourself the leverage to do this.

Anything scalable is at risk

Scalable link building is exactly what Google has been trying to crack down on for the last few years. Penguin was the biggest move and hit some of the most scalable tactics we had at our disposal. When you scale something, you often lose some level of quality, which is exactly what Google doesn’t want when it comes to links. If you’re still relying on tactics that could fall into the scalable category, I think you need to be very careful and just look at the trend in the types of links Google has been penalizing to understand why.

The part Google plays in this

To finish up, I want to briefly talk about the part that Google plays in all of this and shaping the future they want for the web.

I’ve always tried to steer clear of arguments involving the idea that Google is actively pushing FUD into the community. I’ve preferred to concentrate more on things I can actually influence and change with my clients rather than what Google is telling us all to do.

However, for the purposes of this post, I want to talk about it.

General paranoia has increased. My bet is there are some companies out there carrying out zero specific linkbuilding activity through worry.

Dan Barker

Dan’s point is a very fair one and just a day or two after reading this in an email, I came across a page related to a client’s target audience that said:

“We are not publishing guest posts on SITE NAME any more. All previous guest posts are now deleted. For more information, see www.mattcutts.com/blog/guest-blogging/“.

I’ve reworded this as to not reveal the name of the site, but you get the point.

This is silly. Honestly, so silly. They are a good site, publish good content, and had good editorial standards. Yet they have ignored all of their own policies, hard work, and objectives to follow a blog post from Matt. I’m 100% confident that it wasn’t sites like this one that Matt was talking about in this blog post.

This is, of course, from the publishers’ angle rather than the link builders’ angle, but it does go to show the effect that statements from Google can have. Google know this so it does make sense for them to push out messages that make their jobs easier and suit their own objectives—why wouldn’t they? In a similar way, what did they do when they were struggling to classify at scale which links are bad vs. good and they didn’t have a big enough web spam team? They got us to do it for them 🙂

I’m mostly joking here, but you see the point.

The most recent infamous mobilegeddon update, discussed here by Dr. Pete is another example of Google pushing out messages that ultimately scared a lot of people into action. Although to be fair, I think that despite the apparent small impact so far, the broad message from Google is a very serious one.

Because of this, I think we need to remember that Google does have their own agenda and many shareholders to keep happy. I’m not in the camp of believing everything that Google puts out is FUD, but I’m much more sensitive and questioning of the messages now than I’ve ever been.

What do you think? I’d love to hear your feedback and thoughts in the comments.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

What Deep Learning and Machine Learning Mean For the Future of SEO – Whiteboard Friday

Posted by randfish

Imagine a world where even the high-up Google engineers don’t know what’s in the ranking algorithm. We may be moving in that direction. In today’s Whiteboard Friday, Rand explores and explains the concepts of deep learning and machine learning, drawing us a picture of how they could impact our work as SEOs.

For reference, here’s a still of this week’s whiteboard!

Video transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week we are going to take a peek into Google’s future and look at what it could mean as Google advances their machine learning and deep learning capabilities. I know these sound like big, fancy, important words. They’re not actually that tough of topics to understand. In fact, they’re simplistic enough that even a lot of technology firms like Moz do some level of machine learning. We don’t do anything with deep learning and a lot of neural networks. We might be going that direction.

But I found an article that was published in January, absolutely fascinating and I think really worth reading, and I wanted to extract some of the contents here for Whiteboard Friday because I do think this is tactically and strategically important to understand for SEOs and really important for us to understand so that we can explain to our bosses, our teams, our clients how SEO works and will work in the future.

The article is called “Google Search Will Be Your Next Brain.” It’s by Steve Levy. It’s over on Medium. I do encourage you to read it. It’s a relatively lengthy read, but just a fascinating one if you’re interested in search. It starts with a profile of Geoff Hinton, who was a professor in Canada and worked on neural networks for a long time and then came over to Google and is now a distinguished engineer there. As the article says, a quote from the article: “He is versed in the black art of organizing several layers of artificial neurons so that the entire system, the system of neurons, could be trained or even train itself to divine coherence from random inputs.”

This sounds complex, but basically what we’re saying is we’re trying to get machines to come up with outcomes on their own rather than us having to tell them all the inputs to consider and how to process those incomes and the outcome to spit out. So this is essentially machine learning. Google has used this, for example, to figure out when you give it a bunch of photos and it can say, “Oh, this is a landscape photo. Oh, this is an outdoor photo. Oh, this is a photo of a person.” Have you ever had that creepy experience where you upload a photo to Facebook or to Google+ and they say, “Is this your friend so and so?” And you’re like, “God, that’s a terrible shot of my friend. You can barely see most of his face, and he’s wearing glasses which he usually never wears. How in the world could Google+ or Facebook figure out that this is this person?”

That’s what they use, these neural networks, these deep machine learning processes for. So I’ll give you a simple example. Here at MOZ, we do machine learning very simplistically for page authority and domain authority. We take all the inputs — numbers of links, number of linking root domains, every single metric that you could get from MOZ on the page level, on the sub-domain level, on the root-domain level, all these metrics — and then we combine them together and we say, “Hey machine, we want you to build us the algorithm that best correlates with how Google ranks pages, and here’s a bunch of pages that Google has ranked.” I think we use a base set of 10,000, and we do it about quarterly or every 6 months, feed that back into the system and the system pumps out the little algorithm that says, “Here you go. This will give you the best correlating metric with how Google ranks pages.” That’s how you get page authority domain authority.

Cool, really useful, helpful for us to say like, “Okay, this page is probably considered a little more important than this page by Google, and this one a lot more important.” Very cool. But it’s not a particularly advanced system. The more advanced system is to have these kinds of neural nets in layers. So you have a set of networks, and these neural networks, by the way, they’re designed to replicate nodes in the human brain, which is in my opinion a little creepy, but don’t worry. The article does talk about how there’s a board of scientists who make sure Terminator 2 doesn’t happen, or Terminator 1 for that matter. Apparently, no one’s stopping Terminator 4 from happening? That’s the new one that’s coming out.

So one layer of the neural net will identify features. Another layer of the neural net might classify the types of features that are coming in. Imagine this for search results. Search results are coming in, and Google’s looking at the features of all the websites and web pages, your websites and pages, to try and consider like, “What are the elements I could pull out from there?”

Well, there’s the link data about it, and there are things that happen on the page. There are user interactions and all sorts of stuff. Then we’re going to classify types of pages, types of searches, and then we’re going to extract the features or metrics that predict the desired result, that a user gets a search result they really like. We have an algorithm that can consistently produce those, and then neural networks are hopefully designed — that’s what Geoff Hinton has been working on — to train themselves to get better. So it’s not like with PA and DA, our data scientist Matt Peters and his team looking at it and going, “I bet we could make this better by doing this.”

This is standing back and the guys at Google just going, “All right machine, you learn.” They figure it out. It’s kind of creepy, right?

In the original system, you needed those people, these individuals here to feed the inputs, to say like, “This is what you can consider, system, and the features that we want you to extract from it.”

Then unsupervised learning, which is kind of this next step, the system figures it out. So this takes us to some interesting places. Imagine the Google algorithm, circa 2005. You had basically a bunch of things in here. Maybe you’d have anchor text, PageRank and you’d have some measure of authority on a domain level. Maybe there are people who are tossing new stuff in there like, “Hey algorithm, let’s consider the location of the searcher. Hey algorithm, let’s consider some user and usage data.” They’re tossing new things into the bucket that the algorithm might consider, and then they’re measuring it, seeing if it improves.

But you get to the algorithm today, and gosh there are going to be a lot of things in there that are driven by machine learning, if not deep learning yet. So there are derivatives of all of these metrics. There are conglomerations of them. There are extracted pieces like, “Hey, we only ant to look and measure anchor text on these types of results when we also see that the anchor text matches up to the search queries that have previously been performed by people who also search for this.” What does that even mean? But that’s what the algorithm is designed to do. The machine learning system figures out things that humans would never extract, metrics that we would never even create from the inputs that they can see.

Then, over time, the idea is that in the future even the inputs aren’t given by human beings. The machine is getting to figure this stuff out itself. That’s weird. That means that if you were to ask a Google engineer in a world where deep learning controls the ranking algorithm, if you were to ask the people who designed the ranking system, “Hey, does it matter if I get more links,” they might be like, “Well, maybe.” But they don’t know, because they don’t know what’s in this algorithm. Only the machine knows, and the machine can’t even really explain it. You could go take a snapshot and look at it, but (a) it’s constantly evolving, and (b) a lot of these metrics are going to be weird conglomerations and derivatives of a bunch of metrics mashed together and torn apart and considered only when certain criteria are fulfilled. Yikes.

So what does that mean for SEOs. Like what do we have to care about from all of these systems and this evolution and this move towards deep learning, which by the way that’s what Jeff Dean, who is, I think, a senior fellow over at Google, he’s the dude that everyone mocks for being the world’s smartest computer scientist over there, and Jeff Dean has basically said, “Hey, we want to put this into search. It’s not there yet, but we want to take these models, these things that Hinton has built, and we want to put them into search.” That for SEOs in the future is going to mean much less distinct universal ranking inputs, ranking factors. We won’t really have ranking factors in the way that we know them today. It won’t be like, “Well, they have more anchor text and so they rank higher.” That might be something we’d still look at and we’d say, “Hey, they have this anchor text. Maybe that’s correlated with what the machine is finding, the system is finding to be useful, and that’s still something I want to care about to a certain extent.”

But we’re going to have to consider those things a lot more seriously. We’re going to have to take another look at them and decide and determine whether the things that we thought were ranking factors still are when the neural network system takes over. It also is going to mean something that I think many, many SEOs have been predicting for a long time and have been working towards, which is more success for websites that satisfy searchers. If the output is successful searches, and that’ s what the system is looking for, and that’s what it’s trying to correlate all its metrics to, if you produce something that means more successful searches for Google searchers when they get to your site, and you ranking in the top means Google searchers are happier, well you know what? The algorithm will catch up to you. That’s kind of a nice thing. It does mean a lot less info from Google about how they rank results.

So today you might hear from someone at Google, “Well, page speed is a very small ranking factor.” In the future they might be, “Well, page speed is like all ranking factors, totally unknown to us.” Because the machine might say, “Well yeah, page speed as a distinct metric, one that a Google engineer could actually look at, looks very small.” But derivatives of things that are connected to page speed may be huge inputs. Maybe page speed is something, that across all of these, is very well connected with happier searchers and successful search results. Weird things that we never thought of before might be connected with them as the machine learning system tries to build all those correlations, and that means potentially many more inputs into the ranking algorithm, things that we would never consider today, things we might consider wholly illogical, like, “What servers do you run on?” Well, that seems ridiculous. Why would Google ever grade you on that?

If human beings are putting factors into the algorithm, they never would. But the neural network doesn’t care. It doesn’t care. It’s a honey badger. It doesn’t care what inputs it collects. It only cares about successful searches, and so if it turns out that Ubuntu is poorly correlated with successful search results, too bad.

This world is not here yet today, but certainly there are elements of it. Google has talked about how Panda and Penguin are based off of machine learning systems like this. I think, given what Geoff Hinton and Jeff Dean are working on at Google, it sounds like this will be making its way more seriously into search and therefore it’s something that we’re really going to have to consider as search marketers.

All right everyone, I hope you’ll join me again next week for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Is It Possible to Have Good SEO Simply by Having Great Content – Whiteboard Friday

Posted by randfish

This question, posed by Alex Moravek in our Q&A section, has a somewhat complicated answer. In today’s Whiteboard Friday, Rand discusses how organizations might perform well in search rankings without doing any link building at all, relying instead on the strength of their content to be deemed relevant and important by Google.

For reference, here’s a still of this week’s whiteboard!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re chatting about is it possible to have good SEO simply by focusing on great content to the exclusion of link building.

This question was posed in the Moz Q&A Forum, which I deeply love, by Alex Moravek — I might not be saying your name right, Alex, and for that I apologize — from SEO Agencias in Madrid. My Spanish is poor, but my love for churros is so strong.

Alex, I think this is a great question. In fact, we get asked this all the time by all sorts of folks, particularly people in the blogging world and people with small and medium businesses who hear about SEO and go, “Okay, I think can make my website accessible, and yes, I can produce great content, but I just either don’t feel comfortable, don’t have time and energy, don’t understand, or just don’t feel okay with doing link building.” Link acquisition through an outreach and a manual process is beyond the scope of what they can fit into their marketing activities.

In fact, it is possible kind of, sort of. It is possible, but what you desperately need in order for this strategy to be possible are really two things. One is content exposure, and two you need time. I’ll explain why you need both of these things.

I’m going to dramatically simplify Google’s ranking algorithm. In fact, I’m going to simplify it so much that those of you who are SEO professionals are going to be like, “Oh God, Rand, you’re killing me.” I apologize in advance. Just bear with me a second.

We basically have keywords and on-page stuff, topical relevance, etc. All your topic modeling stuff might go in there. There’s content quality, all the factors that Google and Bing might measure around a content’s quality. There’s domain authority. There’s link-based authority based on the links that point to all the pages on a given domain that tell Google or Bing how important pages on this particular domain are.

There are probably some topical relevance elements in there, too. There’s page level authority. These could be all the algorithms you’ve heard of like PageRank and TrustRank, etc., and all the much more modern ones of those.

I’m not specifically talking about Moz scores here, the Moz scores DA and PA. Those are rough interpretations of these much more sophisticated formulas that the engines have.

There’s user and usage data, which we know the engines are using. They’ve talked about using that. There’s spam analysis.

Super simplistic. There are these six things, six broad categories of ranking elements. If you have just these four — keywords, on-page content quality, user and usage data, spam analysis, you’re not spammy — without these, without any domain authority or any page authority, it’s next to impossible to rank for competitive terms and very challenging and very unlikely to rank even for stuff in the chunky middle and long tail. Long tail you might rank for a few things if it’s very, very long tail. But these things taken together give you a sense of ranking ability.

Here’s what some marketers, some bloggers, some folks who invest in content nearly to the exclusion of links have found. They have had success with this strategy. They’ve basically elected to entirely ignore link building and let links come to them.

Instead of focusing on link building, they’re going to focus on product quality, press and public relations, social media, offline marketing, word of mouth, content strategy, email marketing, these other channels that can potentially earn them things. Advertising as well potentially could be in here.

What they rely on is that people find them through these other channels. They find them through social, through ads, through offline, through blogs, through very long tail search, through their content, maybe their email marketing list, word of mouth, press. All of these things are discovery mechanisms that are not search.

Once people get to the site, then these websites rely on the fact that, because of the experience people have, the quality of their products, of their content, because all of that stuff is so good, they’re going to earn links naturally.

This is a leap. In fact, for many SEOs, this is kind of a crazy leap to make, because there are so many things that you can do that will nudge people in this link earning direction. We’ve talked about a number of those at Moz. Of course, if you visit the link building section of our blog, there are hundreds if not thousands of great strategies around this.

These folks have elected to ignore all that link building stuff, let the links come to them, and these signals, these people who visit via other channels eventually lead to links which lead to DA, PA ranking ability. I don’t think this strategy is for everyone, but it is possible.

I think in the utopia that Larry Page and Sergey Brin from Google imagined when they were building their first search engine this is, in fact, how they hoped that the web would work. They hoped that people wouldn’t be out actively gaming and manipulating the web’s link graph, but rather that all the links would be earned naturally and editorially.

I think that’s a very, very optimistic and almost naive way of thinking about it. Remember, they were college students at the time. Maybe they were eating their granola, and dancing around, and hoping that everyone on the web would link only for editorial reasons. Not to make fun of granola. I love granola, especially, oh man, with those acai berries. Bowls of those things are great.

This is a potential strategy if you are very uncomfortable with link building and you feel like you can optimize this process. You have all of these channels going on.

For SEOs who are thinking, “Rand, I’m never going to ignore link building,” you can still get a tremendous amount out of thinking about how you optimize the return on investment and especially the exposure that you receive from these and how that might translate naturally into links.

I find looking at websites that accomplish SEO without active link building fascinating, because they have editorially earned those links through very little intentional effort on their own. I think there’s a tremendous amount that we can take away from that process and optimize around this.

Alex, yes, this is possible. Would I recommend it? Only in a very few instances. I think that there’s a ton that SEOs can do to optimize and nudge and create intelligent, non-manipulative ways of earning links that are a little more powerful than just sitting back and waiting, but it is possible.

All right, everyone. Thanks for joining us, and we’ll see you again next week for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 5 years ago from feedproxy.google.com