A new timesaver: Best New Links

We have updated the “new” tab as part of the Majestic toolset so that it is easier to work through the data and identify new links within the Fresh Index, as part of a particular time frame, because whilst our New Links tool is a great way to work through a comprehensive list of the…

The post A new timesaver: Best New Links appeared first on Majestic Blog.

Reblogged 4 years ago from blog.majestic.com

Why Effective, Modern SEO Requires Technical, Creative, and Strategic Thinking – Whiteboard Friday

Posted by randfish

There’s no doubt that quite a bit has changed about SEO, and that the field is far more integrated with other aspects of online marketing than it once was. In today’s Whiteboard Friday, Rand pushes back against the idea that effective modern SEO doesn’t require any technical expertise, outlining a fantastic list of technical elements that today’s SEOs need to know about in order to be truly effective.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy, Moz fans, and welcome to another edition of Whiteboard Friday. This week I’m going to do something unusual. I don’t usually point out these inconsistencies or sort of take issue with other folks’ content on the web, because I generally find that that’s not all that valuable and useful. But I’m going to make an exception here.

There is an article by Jayson DeMers, who I think might actually be here in Seattle — maybe he and I can hang out at some point — called “Why Modern SEO Requires Almost No Technical Expertise.” It was an article that got a shocking amount of traction and attention. On Facebook, it has thousands of shares. On LinkedIn, it did really well. On Twitter, it got a bunch of attention.

Some folks in the SEO world have already pointed out some issues around this. But because of the increasing popularity of this article, and because I think there’s, like, this hopefulness from worlds outside of kind of the hardcore SEO world that are looking to this piece and going, “Look, this is great. We don’t have to be technical. We don’t have to worry about technical things in order to do SEO.”

Look, I completely get the appeal of that. I did want to point out some of the reasons why this is not so accurate. At the same time, I don’t want to rain on Jayson, because I think that it’s very possible he’s writing an article for Entrepreneur, maybe he has sort of a commitment to them. Maybe he had no idea that this article was going to spark so much attention and investment. He does make some good points. I think it’s just really the title and then some of the messages inside there that I take strong issue with, and so I wanted to bring those up.

First off, some of the good points he did bring up.

One, he wisely says, “You don’t need to know how to code or to write and read algorithms in order to do SEO.” I totally agree with that. If today you’re looking at SEO and you’re thinking, “Well, am I going to get more into this subject? Am I going to try investing in SEO? But I don’t even know HTML and CSS yet.”

Those are good skills to have, and they will help you in SEO, but you don’t need them. Jayson’s totally right. You don’t have to have them, and you can learn and pick up some of these things, and do searches, watch some Whiteboard Fridays, check out some guides, and pick up a lot of that stuff later on as you need it in your career. SEO doesn’t have that hard requirement.

And secondly, he makes an intelligent point that we’ve made many times here at Moz, which is that, broadly speaking, a better user experience is well correlated with better rankings.

You make a great website that delivers great user experience, that provides the answers to searchers’ questions and gives them extraordinarily good content, way better than what’s out there already in the search results, generally speaking you’re going to see happy searchers, and that’s going to lead to higher rankings.

But not entirely. There are a lot of other elements that go in here. So I’ll bring up some frustrating points around the piece as well.

First off, there’s no acknowledgment — and I find this a little disturbing — that the ability to read and write code, or even HTML and CSS, which I think are the basic place to start, is helpful or can take your SEO efforts to the next level. I think both of those things are true.

So being able to look at a web page, view source on it, or pull up Firebug in Firefox or something and diagnose what’s going on and then go, “Oh, that’s why Google is not able to see this content. That’s why we’re not ranking for this keyword or term, or why even when I enter this exact sentence in quotes into Google, which is on our page, this is why it’s not bringing it up. It’s because it’s loading it after the page from a remote file that Google can’t access.” These are technical things, and being able to see how that code is built, how it’s structured, and what’s going on there, very, very helpful.

Some coding knowledge also can take your SEO efforts even further. I mean, so many times, SEOs are stymied by the conversations that we have with our programmers and our developers and the technical staff on our teams. When we can have those conversations intelligently, because at least we understand the principles of how an if-then statement works, or what software engineering best practices are being used, or they can upload something into a GitHub repository, and we can take a look at it there, that kind of stuff is really helpful.

Secondly, I don’t like that the article overly reduces all of this information that we have about what we’ve learned about Google. So he mentions two sources. One is things that Google tells us, and others are SEO experiments. I think both of those are true. Although I’d add that there’s sort of a sixth sense of knowledge that we gain over time from looking at many, many search results and kind of having this feel for why things rank, and what might be wrong with a site, and getting really good at that using tools and data as well. There are people who can look at Open Site Explorer and then go, “Aha, I bet this is going to happen.” They can look, and 90% of the time they’re right.

So he boils this down to, one, write quality content, and two, reduce your bounce rate. Neither of those things are wrong. You should write quality content, although I’d argue there are lots of other forms of quality content that aren’t necessarily written — video, images and graphics, podcasts, lots of other stuff.

And secondly, that just doing those two things is not always enough. So you can see, like many, many folks look and go, “I have quality content. It has a low bounce rate. How come I don’t rank better?” Well, your competitors, they’re also going to have quality content with a low bounce rate. That’s not a very high bar.

Also, frustratingly, this really gets in my craw. I don’t think “write quality content” means anything. You tell me. When you hear that, to me that is a totally non-actionable, non-useful phrase that’s a piece of advice that is so generic as to be discardable. So I really wish that there was more substance behind that.

The article also makes, in my opinion, the totally inaccurate claim that modern SEO really is reduced to “the happier your users are when they visit your site, the higher you’re going to rank.”

Wow. Okay. Again, I think broadly these things are correlated. User happiness and rank is broadly correlated, but it’s not a one to one. This is not like a, “Oh, well, that’s a 1.0 correlation.”

I would guess that the correlation is probably closer to like the page authority range. I bet it’s like 0.35 or something correlation. If you were to actually measure this broadly across the web and say like, “Hey, were you happier with result one, two, three, four, or five,” the ordering would not be perfect at all. It probably wouldn’t even be close.

There’s a ton of reasons why sometimes someone who ranks on Page 2 or Page 3 or doesn’t rank at all for a query is doing a better piece of content than the person who does rank well or ranks on Page 1, Position 1.

Then the article suggests five and sort of a half steps to successful modern SEO, which I think is a really incomplete list. So Jayson gives us;

  • Good on-site experience
  • Writing good content
  • Getting others to acknowledge you as an authority
  • Rising in social popularity
  • Earning local relevance
  • Dealing with modern CMS systems (which he notes most modern CMS systems are SEO-friendly)

The thing is there’s nothing actually wrong with any of these. They’re all, generally speaking, correct, either directly or indirectly related to SEO. The one about local relevance, I have some issue with, because he doesn’t note that there’s a separate algorithm for sort of how local SEO is done and how Google ranks local sites in maps and in their local search results. Also not noted is that rising in social popularity won’t necessarily directly help your SEO, although it can have indirect and positive benefits.

I feel like this list is super incomplete. Okay, I brainstormed just off the top of my head in the 10 minutes before we filmed this video a list. The list was so long that, as you can see, I filled up the whole whiteboard and then didn’t have any more room. I’m not going to bother to erase and go try and be absolutely complete.

But there’s a huge, huge number of things that are important, critically important for technical SEO. If you don’t know how to do these things, you are sunk in many cases. You can’t be an effective SEO analyst, or consultant, or in-house team member, because you simply can’t diagnose the potential problems, rectify those potential problems, identify strategies that your competitors are using, be able to diagnose a traffic gain or loss. You have to have these skills in order to do that.

I’ll run through these quickly, but really the idea is just that this list is so huge and so long that I think it’s very, very, very wrong to say technical SEO is behind us. I almost feel like the opposite is true.

We have to be able to understand things like;

  • Content rendering and indexability
  • Crawl structure, internal links, JavaScript, Ajax. If something’s post-loading after the page and Google’s not able to index it, or there are links that are accessible via JavaScript or Ajax, maybe Google can’t necessarily see those or isn’t crawling them as effectively, or is crawling them, but isn’t assigning them as much link weight as they might be assigning other stuff, and you’ve made it tough to link to them externally, and so they can’t crawl it.
  • Disabling crawling and/or indexing of thin or incomplete or non-search-targeted content. We have a bunch of search results pages. Should we use rel=prev/next? Should we robots.txt those out? Should we disallow from crawling with meta robots? Should we rel=canonical them to other pages? Should we exclude them via the protocols inside Google Webmaster Tools, which is now Google Search Console?
  • Managing redirects, domain migrations, content updates. A new piece of content comes out, replacing an old piece of content, what do we do with that old piece of content? What’s the best practice? It varies by different things. We have a whole Whiteboard Friday about the different things that you could do with that. What about a big redirect or a domain migration? You buy another company and you’re redirecting their site to your site. You have to understand things about subdomain structures versus subfolders, which, again, we’ve done another Whiteboard Friday about that.
  • Proper error codes, downtime procedures, and not found pages. If your 404 pages turn out to all be 200 pages, well, now you’ve made a big error there, and Google could be crawling tons of 404 pages that they think are real pages, because you’ve made it a status code 200, or you’ve used a 404 code when you should have used a 410, which is a permanently removed, to be able to get it completely out of the indexes, as opposed to having Google revisit it and keep it in the index.

Downtime procedures. So there’s specifically a… I can’t even remember. It’s a 5xx code that you can use. Maybe it was a 503 or something that you can use that’s like, “Revisit later. We’re having some downtime right now.” Google urges you to use that specific code rather than using a 404, which tells them, “This page is now an error.”

Disney had that problem a while ago, if you guys remember, where they 404ed all their pages during an hour of downtime, and then their homepage, when you searched for Disney World, was, like, “Not found.” Oh, jeez, Disney World, not so good.

  • International and multi-language targeting issues. I won’t go into that. But you have to know the protocols there. Duplicate content, syndication, scrapers. How do we handle all that? Somebody else wants to take our content, put it on their site, what should we do? Someone’s scraping our content. What can we do? We have duplicate content on our own site. What should we do?
  • Diagnosing traffic drops via analytics and metrics. Being able to look at a rankings report, being able to look at analytics connecting those up and trying to see: Why did we go up or down? Did we have less pages being indexed, more pages being indexed, more pages getting traffic less, more keywords less?
  • Understanding advanced search parameters. Today, just today, I was checking out the related parameter in Google, which is fascinating for most sites. Well, for Moz, weirdly, related:oursite.com shows nothing. But for virtually every other sit, well, most other sites on the web, it does show some really interesting data, and you can see how Google is connecting up, essentially, intentions and topics from different sites and pages, which can be fascinating, could expose opportunities for links, could expose understanding of how they view your site versus your competition or who they think your competition is.

Then there are tons of parameters, like in URL and in anchor, and da, da, da, da. In anchor doesn’t work anymore, never mind about that one.

I have to go faster, because we’re just going to run out of these. Like, come on. Interpreting and leveraging data in Google Search Console. If you don’t know how to use that, Google could be telling you, you have all sorts of errors, and you don’t know what they are.

  • Leveraging topic modeling and extraction. Using all these cool tools that are coming out for better keyword research and better on-page targeting. I talked about a couple of those at MozCon, like MonkeyLearn. There’s the new Moz Context API, which will be coming out soon, around that. There’s the Alchemy API, which a lot of folks really like and use.
  • Identifying and extracting opportunities based on site crawls. You run a Screaming Frog crawl on your site and you’re going, “Oh, here’s all these problems and issues.” If you don’t have these technical skills, you can’t diagnose that. You can’t figure out what’s wrong. You can’t figure out what needs fixing, what needs addressing.
  • Using rich snippet format to stand out in the SERPs. This is just getting a better click-through rate, which can seriously help your site and obviously your traffic.
  • Applying Google-supported protocols like rel=canonical, meta description, rel=prev/next, hreflang, robots.txt, meta robots, x robots, NOODP, XML sitemaps, rel=nofollow. The list goes on and on and on. If you’re not technical, you don’t know what those are, you think you just need to write good content and lower your bounce rate, it’s not going to work.
  • Using APIs from services like AdWords or MozScape, or hrefs from Majestic, or SEM refs from SearchScape or Alchemy API. Those APIs can have powerful things that they can do for your site. There are some powerful problems they could help you solve if you know how to use them. It’s actually not that hard to write something, even inside a Google Doc or Excel, to pull from an API and get some data in there. There’s a bunch of good tutorials out there. Richard Baxter has one, Annie Cushing has one, I think Distilled has some. So really cool stuff there.
  • Diagnosing page load speed issues, which goes right to what Jayson was talking about. You need that fast-loading page. Well, if you don’t have any technical skills, you can’t figure out why your page might not be loading quickly.
  • Diagnosing mobile friendliness issues
  • Advising app developers on the new protocols around App deep linking, so that you can get the content from your mobile apps into the web search results on mobile devices. Awesome. Super powerful. Potentially crazy powerful, as mobile search is becoming bigger than desktop.

Okay, I’m going to take a deep breath and relax. I don’t know Jayson’s intention, and in fact, if he were in this room, he’d be like, “No, I totally agree with all those things. I wrote the article in a rush. I had no idea it was going to be big. I was just trying to make the broader points around you don’t have to be a coder in order to do SEO.” That’s completely fine.

So I’m not going to try and rain criticism down on him. But I think if you’re reading that article, or you’re seeing it in your feed, or your clients are, or your boss is, or other folks are in your world, maybe you can point them to this Whiteboard Friday and let them know, no, that’s not quite right. There’s a ton of technical SEO that is required in 2015 and will be for years to come, I think, that SEOs have to have in order to be effective at their jobs.

All right, everyone. Look forward to some great comments, and we’ll see you again next time for another edition of Whiteboard Friday. Take care.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Controlling Search Engine Crawlers for Better Indexation and Rankings – Whiteboard Friday

Posted by randfish

When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today’s Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don’t want them to go. We’re going to talk a little bit about crawl budget and what you should and shouldn’t have indexed.

As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.

There are a few others that we’re going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let’s dive into those first few first.

Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn’t access, it doesn’t always get respected by Google and Bing. So a lot of folks when you say, “hey, disallow this,” and then you suddenly see those URLs popping up and you’re wondering what’s going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you’ve made a mistake, they think “hey, there’s a lot of links pointing to this content, there’s a lot of people who are visiting and caring about this content, maybe you didn’t intend for us to block it.” The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say “everything behind this entire big directory,” the worse they are about necessarily believing you.

Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it’s usually a lot more respected, because it’s at an individual-page level; Google and Bing tend to believe you about the meta robots tag.

And then the nofollow tag, that lives on an individual link on a page. It doesn’t tell engines where to crawl or not to crawl. All it’s saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.

Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.

What if, for example, we take a page like “blogtest.html” on our domain and we say “all user agents, you are not allowed to crawl blogtest.html. Okay—that’s a good way to keep that page away from being crawled, but just because something is not crawled doesn’t necessarily mean it won’t be in the search results.

So then we have our SEO folks go, “you know what, let’s make doubly sure that doesn’t show up in search results; we’ll put in the meta robots tag:”

<meta name="robots" content="noindex, follow">

So, “noindex, follow” tells the search engine crawler they can follow the links on the page, but they shouldn’t index this particular one.

Then, you go and run a search for “blog test” in this case, and everybody on the team’s like “What the heck!? WTF? Why am I seeing this page show up in search results?”

The answer is, you told the engines that they couldn’t crawl the page, so they didn’t. But they are still putting it in the results. They’re actually probably not going to include a meta description; they might have something like “we can’t include a meta description because of this site’s robots.txt file.” The reason it’s showing up is because they can’t see the noindex; all they see is the disallow.

So, if you want something truly removed, unable to be seen in search results, you can’t just disallow a crawler. You have to say meta “noindex” and you have to let them crawl it.

So this creates some complications. Robots.txt can be great if we’re trying to save crawl bandwidth, but it isn’t necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying “Google, don’t crawl the www version of twitter.com.” What you should be doing is rel canonical-ing or using a 301.

Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.

The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.

Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404’d something or if you’ve told them not to crawl something but it’s still showing up in there, you can manually say “don’t do that.” There are a few other crawl protocol things that you can do.

And then URL status codes—these are a valid way to do things, but they’re going to obviously change what’s going on on your pages, too.

If you’re not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines “it’s back!” 410 is permanent removal.

301—permanent redirect, we’ve talked about those here—and 302, temporary redirect.

Now let’s jump into a few specific use cases of “what kinds of content should and shouldn’t I allow engines to crawl and index” in this next version…

[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]

Four crawling/indexing problems to solve

So we’ve got these four big problems that I want to talk about as they relate to crawling and indexing.

1. Content that isn’t ready yet

The first one here is around, “If I have content of quality I’m still trying to improve—it’s not yet ready for primetime, it’s not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I’m rewriting the content and creating unique value on those pages… they’re just not ready yet—what should I do with those?”

My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I’d disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.

If I’m talking about a small quantity—a few dozen, a few hundred pages—well, I’d probably just use the meta robots noindex, and then I’d pull that noindex off of those pages as they are made ready for Google’s consumption. And then again, I would probably use the XML sitemap and start submitting those once they’re ready.

2. Dealing with duplicate or thin content

What about, “Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?” I’ve got an example. Let’s say I’m an ecommerce shop, I’m selling this nice Star Wars t-shirt which I think is kind of hilarious, so I’ve got starwarsshirt.html, and it links out to a larger version of an image, and that’s an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn’t recommend disallowing crawling on these, and I wouldn’t recommend noindexing them. What I would do there is a rel canonical.

Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn’t see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don’t get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta “noindex, follow” on these pages, but I don’t really think that’s necessary, and again that might interfere with the rel canonical.

3. Passing link equity without appearing in search results

Number three: “If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don’t need those appearing in search results—what should I use then?”

What I would say here is, you can use the meta robots to say “don’t index the page, but do follow the links that are on that page.” That’s a pretty nice, handy use case for that.

Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can’t see the noindex. They don’t know that they can follow it. Granted, as we talked about before, sometimes Google doesn’t obey the robots.txt, but you can’t rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots “noindex, follow” is the way to do this.

4. Search results-type pages

Finally, fourth, “What should I do with search results-type pages?” Google has said many times that they don’t like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.

Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you’ve got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, “I’m looking for restaurants in Seattle, WA,” they’ll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that’s not just a list of search results, that is more of a landing page than a search results page.

However, that being said, if you’ve got a long tail of these, or if you’d say “hey, our internal search engine, that’s really for internal visitors only—it’s not useful to have those pages show up in search results, and we don’t think we need to make the effort to make those into category landing pages.” Then you can use the disallow in robots.txt to prevent those.

Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren’t valuable pages that should be served up and turned into landing pages. If you’re sure, then go ahead and disallow all your search results-style pages. You’ll see a lot of sites doing this in their robots.txt file.

That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I’ll try and tackle those in the comments below.

We’ll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Big Data, Big Problems: 4 Major Link Indexes Compared

Posted by russangular

Given this blog’s readership, chances are good you will spend some time this week looking at backlinks in one of the growing number of link data tools. We know backlinks continue to be one of, if not the most important
parts of Google’s ranking algorithm. We tend to take these link data sets at face value, though, in part because they are all we have. But when your rankings are on the line, is there a better way to get at which data set is the best? How should we go
about assessing these different link indexes like
Moz,
Majestic, Ahrefs and SEMrush for quality? Historically, there have been 4 common approaches to this question of index quality…

  • Breadth: We might choose to look at the number of linking root domains any given service reports. We know
    that referring domains correlates strongly with search rankings, so it makes sense to judge a link index by how many unique domains it has
    discovered and indexed.
  • Depth: We also might choose to look at how deep the web has been crawled, looking more at the total number of URLs
    in the index, rather than the diversity of referring domains.
  • Link Overlap: A more sophisticated approach might count the number of links an index has in common with Google Webmaster
    Tools.
  • Freshness: Finally, we might choose to look at the freshness of the index. What percentage of links in the index are
    still live?

There are a number of really good studies (some newer than others) using these techniques that are worth checking out when you get a chance:

  • BuiltVisible analysis of Moz, Majestic, GWT, Ahrefs and Search Metrics
  • SEOBook comparison of Moz, Majestic, Ahrefs, and Ayima
  • MatthewWoodward
    study of Ahrefs, Majestic, Moz, Raven and SEO Spyglass
  • Marketing Signals analysis of Moz, Majestic, Ahrefs, and GWT
  • RankAbove comparison of Moz, Majestic, Ahrefs and Link Research Tools
  • StoneTemple study of Moz and Majestic

While these are all excellent at addressing the methodologies above, there is a particular limitation with all of them. They miss one of the
most important metrics we need to determine the value of a link index: proportional representation to Google’s link graph
. So here at Angular Marketing, we decided to take a closer look.

Proportional representation to Google Search Console data

So, why is it important to determine proportional representation? Many of the most important and valued metrics we use are built on proportional
models. PageRank, MozRank, CitationFlow and Ahrefs Rank are proportional in nature. The score of any one URL in the data set is relative to the
other URLs in the data set. If the data set is biased, the results are biased.

A Visualization

Link graphs are biased by their crawl prioritization. Because there is no full representation of the Internet, every link graph, even Google’s,
is a biased sample of the web. Imagine for a second that the picture below is of the web. Each dot represents a page on the Internet,
and the dots surrounded by green represent a fictitious index by Google of certain sections of the web.

Of course, Google isn’t the only organization that crawls the web. Other organizations like Moz,
Majestic, Ahrefs, and SEMrush
have their own crawl prioritizations which result in different link indexes.

In the example above, you can see different link providers trying to index the web like Google. Link data provider 1 (purple) does a good job
of building a model that is similar to Google. It isn’t very big, but it is proportional. Link data provider 2 (blue) has a much larger index,
and likely has more links in common with Google that link data provider 1, but it is highly disproportional. So, how would we go about measuring
this proportionality? And which data set is the most proportional to Google?

Methodology

The first step is to determine a measurement of relativity for analysis. Google doesn’t give us very much information about their link graph.
All we have is what is in Google Search Console. The best source we can use is referring domain counts. In particular, we want to look at
what we call
referring domain link pairs. A referring domain link pair would be something like ask.com->mlb.com: 9,444 which means
that ask.com links to mlb.com 9,444 times.

Steps

  1. Determine the root linking domain pairs and values to 100+ sites in Google Search Console
  2. Determine the same for Ahrefs, Moz, Majestic Fresh, Majestic Historic, SEMrush
  3. Compare the referring domain link pairs of each data set to Google, assuming a
    Poisson Distribution
  4. Run simulations of each data set’s performance against each other (ie: Moz vs Maj, Ahrefs vs SEMrush, Moz vs SEMrush, et al.)
  5. Analyze the results

Results

When placed head-to-head, there seem to be some clear winners at first glance. In head-to-head, Moz edges out Ahrefs, but across the board, Moz and Ahrefs fare quite evenly. Moz, Ahrefs and SEMrush seem to be far better than Majestic Fresh and Majestic Historic. Is that really the case? And why?

It turns out there is an inversely proportional relationship between index size and proportional relevancy. This might seem counterintuitive,
shouldn’t the bigger indexes be closer to Google? Not Exactly.

What does this mean?

Each organization has to create a crawl prioritization strategy. When you discover millions of links, you have to prioritize which ones you
might crawl next. Google has a crawl prioritization, so does Moz, Majestic, Ahrefs and SEMrush. There are lots of different things you might
choose to prioritize…

  • You might prioritize link discovery. If you want to build a very large index, you could prioritize crawling pages on sites that
    have historically provided new links.
  • You might prioritize content uniqueness. If you want to build a search engine, you might prioritize finding pages that are unlike
    any you have seen before. You could choose to crawl domains that historically provide unique data and little duplicate content.
  • You might prioritize content freshness. If you want to keep your search engine recent, you might prioritize crawling pages that
    change frequently.
  • You might prioritize content value, crawling the most important URLs first based on the number of inbound links to that page.

Chances are, an organization’s crawl priority will blend some of these features, but it’s difficult to design one exactly like Google. Imagine
for a moment that instead of crawling the web, you want to climb a tree. You have to come up with a tree climbing strategy.

  • You decide to climb the longest branch you see at each intersection.
  • One friend of yours decides to climb the first new branch he reaches, regardless of how long it is.
  • Your other friend decides to climb the first new branch she reaches only if she sees another branch coming off of it.

Despite having different climb strategies, everyone chooses the same first branch, and everyone chooses the same second branch. There are only
so many different options early on.

But as the climbers go further and further along, their choices eventually produce differing results. This is exactly the same for web crawlers
like Google, Moz, Majestic, Ahrefs and SEMrush. The bigger the crawl, the more the crawl prioritization will cause disparities. This is not a
deficiency; this is just the nature of the beast. However, we aren’t completely lost. Once we know how index size is related to disparity, we
can make some inferences about how similar a crawl priority may be to Google.

Unfortunately, we have to be careful in our conclusions. We only have a few data points with which to work, so it is very difficult to be
certain regarding this part of the analysis. In particular, it seems strange that Majestic would get better relative to its index size as it grows,
unless Google holds on to old data (which might be an important discovery in and of itself). It is most likely that at this point we can’t make
this level of conclusion.

So what do we do?

Let’s say you have a list of domains or URLs for which you would like to know their relative values. Your process might look something like
this…

  • Check Open Site Explorer to see if all URLs are in their index. If so, you are looking metrics most likely to be proportional to Google’s link graph.
  • If any of the links do not occur in the index, move to Ahrefs and use their Ahrefs ranking if all you need is a single PageRank-like metric.
  • If any of the links are missing from Ahrefs’s index, or you need something related to trust, move on to Majestic Fresh.
  • Finally, use Majestic Historic for (by leaps and bounds) the largest coverage available.

It is important to point out that the likelihood that all the URLs you want to check are in a single index increases as the accuracy of the metric
decreases. Considering the size of Majestic’s data, you can’t ignore them because you are less likely to get null value answers from their data than
the others. If anything rings true, it is that once again it makes sense to get data
from as many sources as possible. You won’t
get the most proportional data without Moz, the broadest data without Majestic, or everything in-between without Ahrefs.

What about SEMrush? They are making progress, but they don’t publish any relative statistics that would be useful in this particular
case. Maybe we can hope to see more from them soon given their already promising index!

Recommendations for the link graphing industry

All we hear about these days is big data; we almost never hear about good data. I know that the teams at Moz,
Majestic, Ahrefs, SEMrush and others are interested in mimicking Google, but I would love to see some organization stand up against the
allure of
more data in favor of better data—data more like Google’s. It could begin with testing various crawl strategies to see if they produce
a result more similar to that of data shared in Google Search Console. Having the most Google-like data is certainly a crown worth winning.

Credits

Thanks to Diana Carter at Angular for assistance with data acquisition and Andrew Cron with statistical analysis. Thanks also to the representatives from Moz, Majestic, Ahrefs, and SEMrush for answering questions about their indices.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

The Colossus Update: Waking The Giant

Posted by Dr-Pete

Yesterday morning, we woke up to a historically massive temperature spike on MozCast, after an unusually quiet weekend. The 10-day weather looked like this:

That’s 101.8°F, one of the hottest verified days on record, second only to a series of unconfirmed spikes in June of 2013. For reference, the first Penguin update clocked in at 93.1°.

Unfortunately, trying to determine how the algorithm changed from looking at individual keywords (even thousands of them) is more art than science, and even the art is more often Ms. Johnson’s Kindergarten class than Picasso. Sometimes, though, we catch a break and spot something.

The First Clue: HTTPS

When you watch enough SERPs, you start to realize that change is normal. So, the trick is to find the queries that changed a lot on the day in question but are historically quiet. Looking at a few of these, I noticed some apparent shake-ups in HTTP vs. HTTPS (secure) URLs. So, the question becomes: are these anecdotes, or do they represent a pattern?

I dove in and looked at how many URLs for our 10,000 page-1 SERPs were HTTPS over the past few days, and I saw this:

On the morning of June 17, HTTPS URLs on page 1 jumped from 16.9% to 18.4% (a 9.9% day-over-day increase), after trending up for a few days. This represents the total real-estate occupied by HTTPS URLs, but how did rankings fare? Here are the average rankings across all HTTPS results:

HTTPS URLs also seem to have gotten a rankings boost – dropping (with “dropping” being a positive thing) from an average of 2.96 to 2.79 in the space of 24 hours.

Seems pretty convincing, right? Here’s the problem: rankings don’t just change because Google changes the algorithm. We are, collectively, changing the web every minute of the day. Often, those changes are just background noise (and there’s a lot of noise), but sometimes a giant awakens.

The Second Clue: Wikipedia

Anecdotally, I noticed that some Wikipedia URLs seemed to be flipping from HTTP to HTTPS. I ran a quick count, and this wasn’t just a fluke. It turns out that Wikipedia started switching their entire site to HTTPS around June 12 (hat tip to Jan Dunlop). This change is expected to take a couple of weeks.

It’s just one site, though, right? Well, historically, this one site is the #1 largest land-holder across the SERP real-estate we track, with over 5% of the total page-1 URLs in our tracking data (5.19% as of June 17). Wikipedia is a giant, and its movements can shake the entire web.

So, how do we tease this apart? If Wikipedia’s URLs had simply flipped from HTTP to HTTPS, we should see a pretty standard pattern of shake-up. Those URLs would look to have changed, but the SERPS around them would be quiet. So, I ran an analysis of what the temperature would’ve been if we ignored the protocol (treating HTTP/HTTPS as the same). While slightly lower, that temperature was still a scorching 96.6°F.

Is it possible that Wikipedia moving to HTTPS also made the site eligible for a rankings boost from previous algorithm updates, thus disrupting page 1 without any code changes on Google’s end? Yes, it is possible – even a relatively small rankings boost for Wikipedia from the original HTTPS algorithm update could have a broad impact.

The Third Clue: Google?

So far, Google has only said that this was not a Panda update. There have been rumors that the HTTPS update would get a boost, as recently as SMX Advanced earlier this month, but no timeline was given for when that might happen.

Is it possible that Wikipedia’s publicly announced switch finally gave Google the confidence to boost the HTTPS signal? Again, yes, it’s possible, but we can only speculate at this point.

My gut feeling is that this was more than just a waking giant, even as powerful of a SERP force as Wikipedia has become. We should know more as their HTTPS roll-out continues and their index settles down. In the meantime, I think we can expect Google to become increasingly serious about HTTPS, even if what we saw yesterday turns out not to have been an algorithm update.

In the meantime, I’m going to melodramatically name this “The Colossus Update” because, well, it sounds cool. If this indeed was an algorithm update, I’m sure Google would prefer something sensible, like “HTTPS Update 2” or “Securageddon” (sorry, Gary).

Update from Google: Gary Illyes said that he’s not aware of an HTTPS update (via Twitter):

No comment on other updates, or the potential impact of a Wikipedia change. I feel strongly that there is an HTTPS connection in the data, but as I said – that doesn’t necessarily mean the algorithm changed.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Now we have over 3 Trillion URLS!

We have just launched a new Historic Index and broken the 3 Trillion mark! Unique URLs crawled: 800,654,991,863 Unique URLs found: 3,088,860,810,721 Date range: 01 Oct 2009 to 04 May 2015 Last updated: 15 Jun 2015 This means we have crossed a milestone of 3 Trillion URLs found.  

The post Now we have over 3 Trillion URLS! appeared first on Majestic Blog.

Reblogged 4 years ago from blog.majestic.com

Eliminate Duplicate Content in Faceted Navigation with Ajax/JSON/JQuery

Posted by EricEnge

One of the classic problems in SEO is that while complex navigation schemes may be useful to users, they create problems for search engines. Many publishers rely on tags such as rel=canonical, or the parameters settings in Webmaster Tools to try and solve these types of issues. However, each of the potential solutions has limitations. In today’s post, I am going to outline how you can use JavaScript solutions to more completely eliminate the problem altogether.

Note that I am not going to provide code examples in this post, but I am going to outline how it works on a conceptual level. If you are interested in learning more about Ajax/JSON/jQuery here are some resources you can check out:

  1. Ajax Tutorial
  2. Learning Ajax/jQuery

Defining the problem with faceted navigation

Having a page of products and then allowing users to sort those products the way they want (sorted from highest to lowest price), or to use a filter to pick a subset of the products (only those over $60) makes good sense for users. We typically refer to these types of navigation options as “faceted navigation.”

However, faceted navigation can cause problems for search engines because they don’t want to crawl and index all of your different sort orders or all your different filtered versions of your pages. They would end up with many different variants of your pages that are not significantly different from a search engine user experience perspective.

Solutions such as rel=canonical tags and parameters settings in Webmaster Tools have some limitations. For example, rel=canonical tags are considered “hints” by the search engines, and they may not choose to accept them, and even if they are accepted, they do not necessarily keep the search engines from continuing to crawl those pages.

A better solution might be to use JSON and jQuery to implement your faceted navigation so that a new page is not created when a user picks a filter or a sort order. Let’s take a look at how it works.

Using JSON and jQuery to filter on the client side

The main benefit of the implementation discussed below is that a new URL is not created when a user is on a page of yours and applies a filter or sort order. When you use JSON and jQuery, the entire process happens on the client device without involving your web server at all.

When a user initially requests one of the product pages on your web site, the interaction looks like this:

using json on faceted navigation

This transfers the page to the browser the user used to request the page. Now when a user picks a sort order (or filter) on that page, here is what happens:

jquery and faceted navigation diagram

When the user picks one of those options, a jQuery request is made to the JSON data object. Translation: the entire interaction happens within the client’s browser and the sort or filter is applied there. Simply put, the smarts to handle that sort or filter resides entirely within the code on the client device that was transferred with the initial request for the page.

As a result, there is no new page created and no new URL for Google or Bing to crawl. Any concerns about crawl budget or inefficient use of PageRank are completely eliminated. This is great stuff! However, there remain limitations in this implementation.

Specifically, if your list of products spans multiple pages on your site, the sorting and filtering will only be applied to the data set already transferred to the user’s browser with the initial request. In short, you may only be sorting the first page of products, and not across the entire set of products. It’s possible to have the initial JSON data object contain the full set of pages, but this may not be a good idea if the page size ends up being large. In that event, we will need to do a bit more.

What Ajax does for you

Now we are going to dig in slightly deeper and outline how Ajax will allow us to handle sorting, filtering, AND pagination. Warning: There is some tech talk in this section, but I will try to follow each technical explanation with a layman’s explanation about what’s happening.

The conceptual Ajax implementation looks like this:

ajax and faceted navigation diagram

In this structure, we are using an Ajax layer to manage the communications with the web server. Imagine that we have a set of 10 pages, the user has gotten the first page of those 10 on their device and then requests a change to the sort order. The Ajax requests a fresh set of data from the web server for your site, similar to a normal HTML transaction, except that it runs asynchronously in a separate thread.

If you don’t know what that means, the benefit is that the rest of the page can load completely while the process to capture the data that the Ajax will display is running in parallel. This will be things like your main menu, your footer links to related products, and other page elements. This can improve the perceived performance of the page.

When a user selects a different sort order, the code registers an event handler for a given object (e.g. HTML Element or other DOM objects) and then executes an action. The browser will perform the action in a different thread to trigger the event in the main thread when appropriate. This happens without needing to execute a full page refresh, only the content controlled by the Ajax refreshes.

To translate this for the non-technical reader, it just means that we can update the sort order of the page, without needing to redraw the entire page, or change the URL, even in the case of a paginated sequence of pages. This is a benefit because it can be faster than reloading the entire page, and it should make it clear to search engines that you are not trying to get some new page into their index.

Effectively, it does this within the existing Document Object Model (DOM), which you can think of as the basic structure of the documents and a spec for the way the document is accessed and manipulated.

How will Google handle this type of implementation?

For those of you who read Adam Audette’s excellent recent post on the tests his team performed on how Google reads Javascript, you may be wondering if Google will still load all these page variants on the same URL anyway, and if they will not like it.

I had the same question, so I reached out to Google’s Gary Illyes to get an answer. Here is the dialog that transpired:

Eric Enge: I’d like to ask you about using JSON and jQuery to render different sort orders and filters within the same URL. I.e. the user selects a sort order or a filter, and the content is reordered and redrawn on the page on the client site. Hence no new URL would be created. It’s effectively a way of canonicalizing the content, since each variant is a strict subset.

Then there is a second level consideration with this approach, which involves doing the same thing with pagination. I.e. you have 10 pages of products, and users still have sorting and filtering options. In order to support sorting and filtering across the entire 10 page set, you use an Ajax solution, so all of that still renders on one URL.

So, if you are on page 1, and a user executes a sort, they get that all back in that one page. However, to do this right, going to page 2 would also render on the same URL. Effectively, you are taking the 10 page set and rendering it all within one URL. This allows sorting, filtering, and pagination without needing to use canonical, noindex, prev/next, or robots.txt.

If this was not problematic for Google, the only downside is that it makes the pagination not visible to Google. Does that make sense, or is it a bad idea?

Gary Illyes
: If you have one URL only, and people have to click on stuff to see different sort orders or filters for the exact same content under that URL, then typically we would only see the default content.

If you don’t have pagination information, that’s not a problem, except we might not see the content on the other pages that are not contained in the HTML within the initial page load. The meaning of rel-prev/next is to funnel the signals from child pages (page 2, 3, 4, etc.) to the group of pages as a collection, or to the view-all page if you have one. If you simply choose to render those paginated versions on a single URL, that will have the same impact from a signals point of view, meaning that all signals will go to a single entity, rather than distributed to several URLs.

Summary

Keep in mind, the reason why Google implemented tags like rel=canonical, NoIndex, rel=prev/next, and others is to reduce their crawling burden and overall page bloat and to help focus signals to incoming pages in the best way possible. The use of Ajax/JSON/jQuery as outlined above does this simply and elegantly.

On most e-commerce sites, there are many different “facets” of how a user might want to sort and filter a list of products. With the Ajax-style implementation, this can be done without creating new pages. The end users get the control they are looking for, the search engines don’t have to deal with excess pages they don’t want to see, and signals in to the site (such as links) are focused on the main pages where they should be.

The one downside is that Google may not see all the content when it is paginated. A site that has lots of very similar products in a paginated list does not have to worry too much about Google seeing all the additional content, so this isn’t much of a concern if your incremental pages contain more of what’s on the first page. Sites that have content that is materially different on the additional pages, however, might not want to use this approach.

These solutions do require Javascript coding expertise but are not really that complex. If you have the ability to consider a path like this, you can free yourself from trying to understand the various tags, their limitations, and whether or not they truly accomplish what you are looking for.

Credit: Thanks for Clark Lefavour for providing a review of the above for technical correctness.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Should I Use Relative or Absolute URLs? – Whiteboard Friday

Posted by RuthBurrReedy

It was once commonplace for developers to code relative URLs into a site. There are a number of reasons why that might not be the best idea for SEO, and in today’s Whiteboard Friday, Ruth Burr Reedy is here to tell you all about why.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Let’s discuss some non-philosophical absolutes and relatives

Howdy, Moz fans. My name is Ruth Burr Reedy. You may recognize me from such projects as when I used to be the Head of SEO at Moz. I’m now the Senior SEO Manager at BigWing Interactive in Oklahoma City. Today we’re going to talk about relative versus absolute URLs and why they are important.

At any given time, your website can have several different configurations that might be causing duplicate content issues. You could have just a standard http://www.example.com. That’s a pretty standard format for a website.

But the main sources that we see of domain level duplicate content are when the non-www.example.com does not redirect to the www or vice-versa, and when the HTTPS versions of your URLs are not forced to resolve to HTTP versions or, again, vice-versa. What this can mean is if all of these scenarios are true, if all four of these URLs resolve without being forced to resolve to a canonical version, you can, in essence, have four versions of your website out on the Internet. This may or may not be a problem.

It’s not ideal for a couple of reasons. Number one, duplicate content is a problem because some people think that duplicate content is going to give you a penalty. Duplicate content is not going to get your website penalized in the same way that you might see a spammy link penalty from Penguin. There’s no actual penalty involved. You won’t be punished for having duplicate content.

The problem with duplicate content is that you’re basically relying on Google to figure out what the real version of your website is. Google is seeing the URL from all four versions of your website. They’re going to try to figure out which URL is the real URL and just rank that one. The problem with that is you’re basically leaving that decision up to Google when it’s something that you could take control of for yourself.

There are a couple of other reasons that we’ll go into a little bit later for why duplicate content can be a problem. But in short, duplicate content is no good.

However, just having these URLs not resolve to each other may or may not be a huge problem. When it really becomes a serious issue is when that problem is combined with injudicious use of relative URLs in internal links. So let’s talk a little bit about the difference between a relative URL and an absolute URL when it comes to internal linking.

With an absolute URL, you are putting the entire web address of the page that you are linking to in the link. You’re putting your full domain, everything in the link, including /page. That’s an absolute URL.

However, when coding a website, it’s a fairly common web development practice to instead code internal links with what’s called a relative URL. A relative URL is just /page. Basically what that does is it relies on your browser to understand, “Okay, this link is pointing to a page that’s on the same domain that we’re already on. I’m just going to assume that that is the case and go there.”

There are a couple of really good reasons to code relative URLs

1) It is much easier and faster to code.

When you are a web developer and you’re building a site and there thousands of pages, coding relative versus absolute URLs is a way to be more efficient. You’ll see it happen a lot.

2) Staging environments

Another reason why you might see relative versus absolute URLs is some content management systems — and SharePoint is a great example of this — have a staging environment that’s on its own domain. Instead of being example.com, it will be examplestaging.com. The entire website will basically be replicated on that staging domain. Having relative versus absolute URLs means that the same website can exist on staging and on production, or the live accessible version of your website, without having to go back in and recode all of those URLs. Again, it’s more efficient for your web development team. Those are really perfectly valid reasons to do those things. So don’t yell at your web dev team if they’ve coded relative URLS, because from their perspective it is a better solution.

Relative URLs will also cause your page to load slightly faster. However, in my experience, the SEO benefits of having absolute versus relative URLs in your website far outweigh the teeny-tiny bit longer that it will take the page to load. It’s very negligible. If you have a really, really long page load time, there’s going to be a whole boatload of things that you can change that will make a bigger difference than coding your URLs as relative versus absolute.

Page load time, in my opinion, not a concern here. However, it is something that your web dev team may bring up with you when you try to address with them the fact that, from an SEO perspective, coding your website with relative versus absolute URLs, especially in the nav, is not a good solution.

There are even better reasons to use absolute URLs

1) Scrapers

If you have all of your internal links as relative URLs, it would be very, very, very easy for a scraper to simply scrape your whole website and put it up on a new domain, and the whole website would just work. That sucks for you, and it’s great for that scraper. But unless you are out there doing public services for scrapers, for some reason, that’s probably not something that you want happening with your beautiful, hardworking, handcrafted website. That’s one reason. There is a scraper risk.

2) Preventing duplicate content issues

But the other reason why it’s very important to have absolute versus relative URLs is that it really mitigates the duplicate content risk that can be presented when you don’t have all of these versions of your website resolving to one version. Google could potentially enter your site on any one of these four pages, which they’re the same page to you. They’re four different pages to Google. They’re the same domain to you. They are four different domains to Google.

But they could enter your site, and if all of your URLs are relative, they can then crawl and index your entire domain using whatever format these are. Whereas if you have absolute links coded, even if Google enters your site on www. and that resolves, once they crawl to another page, that you’ve got coded without the www., all of that other internal link juice and all of the other pages on your website, Google is not going to assume that those live at the www. version. That really cuts down on different versions of each page of your website. If you have relative URLs throughout, you basically have four different websites if you haven’t fixed this problem.

Again, it’s not always a huge issue. Duplicate content, it’s not ideal. However, Google has gotten pretty good at figuring out what the real version of your website is.

You do want to think about internal linking, when you’re thinking about this. If you have basically four different versions of any URL that anybody could just copy and paste when they want to link to you or when they want to share something that you’ve built, you’re diluting your internal links by four, which is not great. You basically would have to build four times as many links in order to get the same authority. So that’s one reason.

3) Crawl Budget

The other reason why it’s pretty important not to do is because of crawl budget. I’m going to point it out like this instead.

When we talk about crawl budget, basically what that is, is every time Google crawls your website, there is a finite depth that they will. There’s a finite number of URLs that they will crawl and then they decide, “Okay, I’m done.” That’s based on a few different things. Your site authority is one of them. Your actual PageRank, not toolbar PageRank, but how good Google actually thinks your website is, is a big part of that. But also how complex your site is, how often it’s updated, things like that are also going to contribute to how often and how deep Google is going to crawl your site.

It’s important to remember when we think about crawl budget that, for Google, crawl budget cost actual dollars. One of Google’s biggest expenditures as a company is the money and the bandwidth it takes to crawl and index the Web. All of that energy that’s going into crawling and indexing the Web, that lives on servers. That bandwidth comes from servers, and that means that using bandwidth cost Google actual real dollars.

So Google is incentivized to crawl as efficiently as possible, because when they crawl inefficiently, it cost them money. If your site is not efficient to crawl, Google is going to save itself some money by crawling it less frequently and crawling to a fewer number of pages per crawl. That can mean that if you have a site that’s updated frequently, your site may not be updating in the index as frequently as you’re updating it. It may also mean that Google, while it’s crawling and indexing, may be crawling and indexing a version of your website that isn’t the version that you really want it to crawl and index.

So having four different versions of your website, all of which are completely crawlable to the last page, because you’ve got relative URLs and you haven’t fixed this duplicate content problem, means that Google has to spend four times as much money in order to really crawl and understand your website. Over time they’re going to do that less and less frequently, especially if you don’t have a really high authority website. If you’re a small website, if you’re just starting out, if you’ve only got a medium number of inbound links, over time you’re going to see your crawl rate and frequency impacted, and that’s bad. We don’t want that. We want Google to come back all the time, see all our pages. They’re beautiful. Put them up in the index. Rank them well. That’s what we want. So that’s what we should do.

There are couple of ways to fix your relative versus absolute URLs problem

1) Fix what is happening on the server side of your website

You have to make sure that you are forcing all of these different versions of your domain to resolve to one version of your domain. For me, I’m pretty agnostic as to which version you pick. You should probably already have a pretty good idea of which version of your website is the real version, whether that’s www, non-www, HTTPS, or HTTP. From my view, what’s most important is that all four of these versions resolve to one version.

From an SEO standpoint, there is evidence to suggest and Google has certainly said that HTTPS is a little bit better than HTTP. From a URL length perspective, I like to not have the www. in there because it doesn’t really do anything. It just makes your URLs four characters longer. If you don’t know which one to pick, I would pick one this one HTTPS, no W’s. But whichever one you pick, what’s really most important is that all of them resolve to one version. You can do that on the server side, and that’s usually pretty easy for your dev team to fix once you tell them that it needs to happen.

2) Fix your internal links

Great. So you fixed it on your server side. Now you need to fix your internal links, and you need to recode them for being relative to being absolute. This is something that your dev team is not going to want to do because it is time consuming and, from a web dev perspective, not that important. However, you should use resources like this Whiteboard Friday to explain to them, from an SEO perspective, both from the scraper risk and from a duplicate content standpoint, having those absolute URLs is a high priority and something that should get done.

You’ll need to fix those, especially in your navigational elements. But once you’ve got your nav fixed, also pull out your database or run a Screaming Frog crawl or however you want to discover internal links that aren’t part of your nav, and make sure you’re updating those to be absolute as well.

Then you’ll do some education with everybody who touches your website saying, “Hey, when you link internally, make sure you’re using the absolute URL and make sure it’s in our preferred format,” because that’s really going to give you the most bang for your buck per internal link. So do some education. Fix your internal links.

Sometimes your dev team going to say, “No, we can’t do that. We’re not going to recode the whole nav. It’s not a good use of our time,” and sometimes they are right. The dev team has more important things to do. That’s okay.

3) Canonicalize it!

If you can’t get your internal links fixed or if they’re not going to get fixed anytime in the near future, a stopgap or a Band-Aid that you can kind of put on this problem is to canonicalize all of your pages. As you’re changing your server to force all of these different versions of your domain to resolve to one, at the same time you should be implementing the canonical tag on all of the pages of your website to self-canonize. On every page, you have a canonical page tag saying, “This page right here that they were already on is the canonical version of this page. ” Or if there’s another page that’s the canonical version, then obviously you point to that instead.

But having each page self-canonicalize will mitigate both the risk of duplicate content internally and some of the risk posed by scrappers, because when they scrape, if they are scraping your website and slapping it up somewhere else, those canonical tags will often stay in place, and that lets Google know this is not the real version of the website.

In conclusion, relative links, not as good. Absolute links, those are the way to go. Make sure that you’re fixing these very common domain level duplicate content problems. If your dev team tries to tell you that they don’t want to do this, just tell them I sent you. Thanks guys.

Video transcription by Speechpad.com

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

How to Use Server Log Analysis for Technical SEO

Posted by SamuelScott

It’s ten o’clock. Do you know where your logs are?

I’m introducing this guide with a pun on a common public-service announcement that has run on late-night TV news broadcasts in the United States because log analysis is something that is extremely newsworthy and important.

If your technical and on-page SEO is poor, then nothing else that you do will matter. Technical SEO is the key to helping search engines to crawl, parse, and index websites, and thereby rank them appropriately long before any marketing work begins.

The important thing to remember: Your log files contain the only data that is 100% accurate in terms of how search engines are crawling your website. By helping Google to do its job, you will set the stage for your future SEO work and make your job easier. Log analysis is one facet of technical SEO, and correcting the problems found in your logs will help to lead to higher rankings, more traffic, and more conversions and sales.

Here are just a few reasons why:

  • Too many response code errors may cause Google to reduce its crawling of your website and perhaps even your rankings.
  • You want to make sure that search engines are crawling everything, new and old, that you want to appear and rank in the SERPs (and nothing else).
  • It’s crucial to ensure that all URL redirections will pass along any incoming “link juice.”

However, log analysis is something that is unfortunately discussed all too rarely in SEO circles. So, here, I wanted to give the Moz community an introductory guide to log analytics that I hope will help. If you have any questions, feel free to ask in the comments!

What is a log file?

Computer servers, operating systems, network devices, and computer applications automatically generate something called a log entry whenever they perform an action. In a SEO and digital marketing context, one type of action is whenever a page is requested by a visiting bot or human.

Server log entries are specifically programmed to be output in the Common Log Format of the W3C consortium. Here is one example from Wikipedia with my accompanying explanations:

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
  • 127.0.0.1 — The remote hostname. An IP address is shown, like in this example, whenever the DNS hostname is not available or DNSLookup is turned off.
  • user-identifier — The remote logname / RFC 1413 identity of the user. (It’s not that important.)
  • frank — The user ID of the person requesting the page. Based on what I see in my Moz profile, Moz’s log entries would probably show either “SamuelScott” or “392388” whenever I visit a page after having logged in.
  • [10/Oct/2000:13:55:36 -0700] — The date, time, and timezone of the action in question in strftime format.
  • GET /apache_pb.gif HTTP/1.0 — “GET” is one of the two commands (the other is “POST”) that can be performed. “GET” fetches a URL while “POST” is submitting something (such as a forum comment). The second part is the URL that is being accessed, and the last part is the version of HTTP that is being accessed.
  • 200 — The status code of the document that was returned.
  • 2326 — The size, in bytes, of the document that was returned.

Note: A hyphen is shown in a field when that information is unavailable.

Every single time that you — or the Googlebot — visit a page on a website, a line with this information is output, recorded, and stored by the server.

Log entries are generated continuously and anywhere from several to thousands can be created every second — depending on the level of a given server, network, or application’s activity. A collection of log entries is called a log file (or often in slang, “the log” or “the logs”), and it is displayed with the most-recent log entry at the bottom. Individual log files often contain a calendar day’s worth of log entries.

Accessing your log files

Different types of servers store and manage their log files differently. Here are the general guides to finding and managing log data on three of the most-popular types of servers:

What is log analysis?

Log analysis (or log analytics) is the process of going through log files to learn something from the data. Some common reasons include:

  • Development and quality assurance (QA) — Creating a program or application and checking for problematic bugs to make sure that it functions properly
  • Network troubleshooting — Responding to and fixing system errors in a network
  • Customer service — Determining what happened when a customer had a problem with a technical product
  • Security issues — Investigating incidents of hacking and other intrusions
  • Compliance matters — Gathering information in response to corporate or government policies
  • Technical SEO — This is my favorite! More on that in a bit.

Log analysis is rarely performed regularly. Usually, people go into log files only in response to something — a bug, a hack, a subpoena, an error, or a malfunction. It’s not something that anyone wants to do on an ongoing basis.

Why? This is a screenshot of ours of just a very small part of an original (unstructured) log file:

Ouch. If a website gets 10,000 visitors who each go to ten pages per day, then the server will create a log file every day that will consist of 100,000 log entries. No one has the time to go through all of that manually.

How to do log analysis

There are three general ways to make log analysis easier in SEO or any other context:

  • Do-it-yourself in Excel
  • Proprietary software such as Splunk or Sumo-logic
  • The ELK Stack open-source software

Tim Resnik’s Moz essay from a few years ago walks you through the process of exporting a batch of log files into Excel. This is a (relatively) quick and easy way to do simple log analysis, but the downside is that one will see only a snapshot in time and not any overall trends. To obtain the best data, it’s crucial to use either proprietary tools or the ELK Stack.

Splunk and Sumo-Logic are proprietary log analysis tools that are primarily used by enterprise companies. The ELK Stack is a free and open-source batch of three platforms (Elasticsearch, Logstash, and Kibana) that is owned by Elastic and used more often by smaller businesses. (Disclosure: We at Logz.io use the ELK Stack to monitor our own internal systems as well as for the basis of our own log management software.)

For those who are interested in using this process to do technical SEO analysis, monitor system or application performance, or for any other reason, our CEO, Tomer Levy, has written a guide to deploying the ELK Stack.

Technical SEO insights in log data

However you choose to access and understand your log data, there are many important technical SEO issues to address as needed. I’ve included screenshots of our technical SEO dashboard with our own website’s data to demonstrate what to examine in your logs.

Bot crawl volume

It’s important to know the number of requests made by Baidu, BingBot, GoogleBot, Yahoo, Yandex, and others over a given period time. If, for example, you want to get found in search in Russia but Yandex is not crawling your website, that is a problem. (You’d want to consult Yandex Webmaster and see this article on Search Engine Land.)

Response code errors

Moz has a great primer on the meanings of the different status codes. I have an alert system setup that tells me about 4XX and 5XX errors immediately because those are very significant.

Temporary redirects

Temporary 302 redirects do not pass along the “link juice” of external links from the old URL to the new one. Almost all of the time, they should be changed to permanent 301 redirects.

Crawl budget waste

Google assigns a crawl budget to each website based on numerous factors. If your crawl budget is, say, 100 pages per day (or the equivalent amount of data), then you want to be sure that all 100 are things that you want to appear in the SERPs. No matter what you write in your robots.txt file and meta-robots tags, you might still be wasting your crawl budget on advertising landing pages, internal scripts, and more. The logs will tell you — I’ve outlined two script-based examples in red above.

If you hit your crawl limit but still have new content that should be indexed to appear in search results, Google may abandon your site before finding it.

Duplicate URL crawling

The addition of URL parameters — typically used in tracking for marketing purposes — often results in search engines wasting crawl budgets by crawling different URLs with the same content. To learn how to address this issue, I recommend reading the resources on Google and Search Engine Land here, here, here, and here.

Crawl priority

Google might be ignoring (and not crawling or indexing) a crucial page or section of your website. The logs will reveal what URLs and/or directories are getting the most and least attention. If, for example, you have published an e-book that attempts to rank for targeted search queries but it sits in a directory that Google only visits once every six months, then you won’t get any organic search traffic from the e-book for up to six months.

If a part of your website is not being crawled very often — and it is updated often enough that it should be — then you might need to check your internal-linking structure and the crawl-priority settings in your XML sitemap.

Last crawl date

Have you uploaded something that you hope will be indexed quickly? The log files will tell you when Google has crawled it.

Crawl budget

One thing I personally like to check and see is Googlebot’s real-time activity on our site because the crawl budget that the search engine assigns to a website is a rough indicator — a very rough one — of how much it “likes” your site. Google ideally does not want to waste valuable crawling time on a bad website. Here, I had seen that Googlebot had made 154 requests of our new startup’s website over the prior twenty-four hours. Hopefully, that number will go up!

As I hope you can see, log analysis is critically important in technical SEO. It’s eleven o’clock — do you know where your logs are now?

Additional resources

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it