Stop Ghost Spam in Google Analytics with One Filter

Posted by CarloSeo

The spam in Google Analytics (GA) is becoming a serious issue. Due to a deluge of referral spam from social buttons, adult sites, and many, many other sources, people are starting to become overwhelmed by all the filters they are setting up to manage the useless data they are receiving.

The good news is, there is no need to panic. In this post, I’m going to focus on the most common mistakes people make when fighting spam in GA, and explain an efficient way to prevent it.

But first, let’s make sure we understand how spam works. A couple of months ago, Jared Gardner wrote an excellent article explaining what referral spam is, including its intended purpose. He also pointed out some great examples of referral spam.

Types of spam

The spam in Google Analytics can be categorized by two types: ghosts and crawlers.

Ghosts

The vast majority of spam is this type. They are called ghosts because they never access your site. It is important to keep this in mind, as it’s key to creating a more efficient solution for managing spam.

As unusual as it sounds, this type of spam doesn’t have any interaction with your site at all. You may wonder how that is possible since one of the main purposes of GA is to track visits to our sites.

They do it by using the Measurement Protocol, which allows people to send data directly to Google Analytics’ servers. Using this method, and probably randomly generated tracking codes (UA-XXXXX-1) as well, the spammers leave a “visit” with fake data, without even knowing who they are hitting.

Crawlers

This type of spam, the opposite to ghost spam, does access your site. As the name implies, these spam bots crawl your pages, ignoring rules like those found in robots.txt that are supposed to stop them from reading your site. When they exit your site, they leave a record on your reports that appears similar to a legitimate visit.

Crawlers are harder to identify because they know their targets and use real data. But it is also true that new ones seldom appear. So if you detect a referral in your analytics that looks suspicious, researching it on Google or checking it against this list might help you answer the question of whether or not it is spammy.

Most common mistakes made when dealing with spam in GA

I’ve been following this issue closely for the last few months. According to the comments people have made on my articles and conversations I’ve found in discussion forums, there are primarily three mistakes people make when dealing with spam in Google Analytics.

Mistake #1. Blocking ghost spam from the .htaccess file

One of the biggest mistakes people make is trying to block Ghost Spam from the .htaccess file.

For those who are not familiar with this file, one of its main functions is to allow/block access to your site. Now we know that ghosts never reach your site, so adding them here won’t have any effect and will only add useless lines to your .htaccess file.

Ghost spam usually shows up for a few days and then disappears. As a result, sometimes people think that they successfully blocked it from here when really it’s just a coincidence of timing.

Then when the spammers later return, they get worried because the solution is not working anymore, and they think the spammer somehow bypassed the barriers they set up.

The truth is, the .htaccess file can only effectively block crawlers such as buttons-for-website.com and a few others since these access your site. Most of the spam can’t be blocked using this method, so there is no other option than using filters to exclude them.

Mistake #2. Using the referral exclusion list to stop spam

Another error is trying to use the referral exclusion list to stop the spam. The name may confuse you, but this list is not intended to exclude referrals in the way we want to for the spam. It has other purposes.

For example, when a customer buys something, sometimes they get redirected to a third-party page for payment. After making a payment, they’re redirected back to you website, and GA records that as a new referral. It is appropriate to use referral exclusion list to prevent this from happening.

If you try to use the referral exclusion list to manage spam, however, the referral part will be stripped since there is no preexisting record. As a result, a direct visit will be recorded, and you will have a bigger problem than the one you started with since. You will still have spam, and direct visits are harder to track.

Mistake #3. Worrying that bounce rate changes will affect rankings

When people see that the bounce rate changes drastically because of the spam, they start worrying about the impact that it will have on their rankings in the SERPs.

bounce.png

This is another mistake commonly made. With or without spam, Google doesn’t take into consideration Google Analytics metrics as a ranking factor. Here is an explanation about this from Matt Cutts, the former head of Google’s web spam team.

And if you think about it, Cutts’ explanation makes sense; because although many people have GA, not everyone uses it.

Assuming your site has been hacked

Another common concern when people see strange landing pages coming from spam on their reports is that they have been hacked.

landing page

The page that the spam shows on the reports doesn’t exist, and if you try to open it, you will get a 404 page. Your site hasn’t been compromised.

But you have to make sure the page doesn’t exist. Because there are cases (not spam) where some sites have a security breach and get injected with pages full of bad keywords to defame the website.

What should you worry about?

Now that we’ve discarded security issues and their effects on rankings, the only thing left to worry about is your data. The fake trail that the spam leaves behind pollutes your reports.

It might have greater or lesser impact depending on your site traffic, but everyone is susceptible to the spam.

Small and midsize sites are the most easily impacted – not only because a big part of their traffic can be spam, but also because usually these sites are self-managed and sometimes don’t have the support of an analyst or a webmaster.

Big sites with a lot of traffic can also be impacted by spam, and although the impact can be insignificant, invalid traffic means inaccurate reports no matter the size of the website. As an analyst, you should be able to explain what’s going on in even in the most granular reports.

You only need one filter to deal with ghost spam

Usually it is recommended to add the referral to an exclusion filter after it is spotted. Although this is useful for a quick action against the spam, it has three big disadvantages.

  • Making filters every week for every new spam detected is tedious and time-consuming, especially if you manage many sites. Plus, by the time you apply the filter, and it starts working, you already have some affected data.
  • Some of the spammers use direct visits along with the referrals.
  • These direct hits won’t be stopped by the filter so even if you are excluding the referral you will sill be receiving invalid traffic, which explains why some people have seen an unusual spike in direct traffic.

Luckily, there is a good way to prevent all these problems. Most of the spam (ghost) works by hitting GA’s random tracking-IDs, meaning the offender doesn’t really know who is the target, and for that reason either the hostname is not set or it uses a fake one. (See report below)

Ghost-Spam.png

You can see that they use some weird names or don’t even bother to set one. Although there are some known names in the list, these can be easily added by the spammer.

On the other hand, valid traffic will always use a real hostname. In most of the cases, this will be the domain. But it also can also result from paid services, translation services, or any other place where you’ve inserted GA tracking code.

Valid-Referral.png

Based on this, we can make a filter that will include only hits that use real hostnames. This will automatically exclude all hits from ghost spam, whether it shows up as a referral, keyword, or pageview; or even as a direct visit.

To create this filter, you will need to find the report of hostnames. Here’s how:

  1. Go to the Reporting tab in GA
  2. Click on Audience in the lefthand panel
  3. Expand Technology and select Network
  4. At the top of the report, click on Hostname

Valid-list

You will see a list of all hostnames, including the ones that the spam uses. Make a list of all the valid hostnames you find, as follows:

  • yourmaindomain.com
  • blog.yourmaindomain.com
  • es.yourmaindomain.com
  • payingservice.com
  • translatetool.com
  • anotheruseddomain.com

For small to medium sites, this list of hostnames will likely consist of the main domain and a couple of subdomains. After you are sure you got all of them, create a regular expression similar to this one:

yourmaindomain\.com|anotheruseddomain\.com|payingservice\.com|translatetool\.com

You don’t need to put all of your subdomains in the regular expression. The main domain will match all of them. If you don’t have a view set up without filters, create one now.

Then create a Custom Filter.

Make sure you select INCLUDE, then select “Hostname” on the filter field, and copy your expression into the Filter Pattern box.

filter

You might want to verify the filter before saving to check that everything is okay. Once you’re ready, set it to save, and apply the filter to all the views you want (except the view without filters).

This single filter will get rid of future occurrences of ghost spam that use invalid hostnames, and it doesn’t require much maintenance. But it’s important that every time you add your tracking code to any service, you add it to the end of the filter.

Now you should only need to take care of the crawler spam. Since crawlers access your site, you can block them by adding these lines to the .htaccess file:

## STOP REFERRER SPAM 
RewriteCond %{HTTP_REFERER} semalt\.com [NC,OR] 
RewriteCond %{HTTP_REFERER} buttons-for-website\.com [NC] 
RewriteRule .* - [F]

It is important to note that this file is very sensitive, and misplacing a single character it it can bring down your entire site. Therefore, make sure you create a backup copy of your .htaccess file prior to editing it.

If you don’t feel comfortable messing around with your .htaccess file, you can alternatively make an expression with all the crawlers, then and add it to an exclude filter by Campaign Source.

Implement these combined solutions, and you will worry much less about spam contaminating your analytics data. This will have the added benefit of freeing up more time for you to spend actually analyze your valid data.

After stopping spam, you can also get clean reports from the historical data by using the same expressions in an Advance Segment to exclude all the spam.

Bonus resources to help you manage spam

If you still need more information to help you understand and deal with the spam on your GA reports, you can read my main article on the subject here: http://www.ohow.co/what-is-referrer-spam-how-stop-it-guide/.

Additional information on how to stop spam can be found at these URLs:

In closing, I am eager to hear your ideas on this serious issue. Please share them in the comments below.

(Editor’s Note: All images featured in this post were created by the author.)

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Distance from Perfect

Posted by wrttnwrd

In spite of all the advice, the strategic discussions and the conference talks, we Internet marketers are still algorithmic thinkers. That’s obvious when you think of SEO.

Even when we talk about content, we’re algorithmic thinkers. Ask yourself: How many times has a client asked you, “How much content do we need?” How often do you still hear “How unique does this page need to be?”

That’s 100% algorithmic thinking: Produce a certain amount of content, move up a certain number of spaces.

But you and I know it’s complete bullshit.

I’m not suggesting you ignore the algorithm. You should definitely chase it. Understanding a little bit about what goes on in Google’s pointy little head helps. But it’s not enough.

A tale of SEO woe that makes you go “whoa”

I have this friend.

He ranked #10 for “flibbergibbet.” He wanted to rank #1.

He compared his site to the #1 site and realized the #1 site had five hundred blog posts.

“That site has five hundred blog posts,” he said, “I must have more.”

So he hired a few writers and cranked out five thousand blogs posts that melted Microsoft Word’s grammar check. He didn’t move up in the rankings. I’m shocked.

“That guy’s spamming,” he decided, “I’ll just report him to Google and hope for the best.”

What happened? Why didn’t adding five thousand blog posts work?

It’s pretty obvious: My, uh, friend added nothing but crap content to a site that was already outranked. Bulk is no longer a ranking tactic. Google’s very aware of that tactic. Lots of smart engineers have put time into updates like Panda to compensate.

He started like this:

And ended up like this:
more posts, no rankings

Alright, yeah, I was Mr. Flood The Site With Content, way back in 2003. Don’t judge me, whippersnappers.

Reality’s never that obvious. You’re scratching and clawing to move up two spots, you’ve got an overtasked IT team pushing back on changes, and you’ve got a boss who needs to know the implications of every recommendation.

Why fix duplication if rel=canonical can address it? Fixing duplication will take more time and cost more money. It’s easier to paste in one line of code. You and I know it’s better to fix the duplication. But it’s a hard sell.

Why deal with 302 versus 404 response codes and home page redirection? The basic user experience remains the same. Again, we just know that a server should return one home page without any redirects and that it should send a ‘not found’ 404 response if a page is missing. If it’s going to take 3 developer hours to reconfigure the server, though, how do we justify it? There’s no flashing sign reading “Your site has a problem!”

Why change this thing and not that thing?

At the same time, our boss/client sees that the site above theirs has five hundred blog posts and thousands of links from sites selling correspondence MBAs. So they want five thousand blog posts and cheap links as quickly as possible.

Cue crazy music.

SEO lacks clarity

SEO is, in some ways, for the insane. It’s an absurd collection of technical tweaks, content thinking, link building and other little tactics that may or may not work. A novice gets exposed to one piece of crappy information after another, with an occasional bit of useful stuff mixed in. They create sites that repel search engines and piss off users. They get more awful advice. The cycle repeats. Every time it does, best practices get more muddled.

SEO lacks clarity. We can’t easily weigh the value of one change or tactic over another. But we can look at our changes and tactics in context. When we examine the potential of several changes or tactics before we flip the switch, we get a closer balance between algorithm-thinking and actual strategy.

Distance from perfect brings clarity to tactics and strategy

At some point you have to turn that knowledge into practice. You have to take action based on recommendations, your knowledge of SEO, and business considerations.

That’s hard when we can’t even agree on subdomains vs. subfolders.

I know subfolders work better. Sorry, couldn’t resist. Let the flaming comments commence.

To get clarity, take a deep breath and ask yourself:

“All other things being equal, will this change, tactic, or strategy move my site closer to perfect than my competitors?”

Breaking it down:

“Change, tactic, or strategy”

A change takes an existing component or policy and makes it something else. Replatforming is a massive change. Adding a new page is a smaller one. Adding ALT attributes to your images is another example. Changing the way your shopping cart works is yet another.

A tactic is a specific, executable practice. In SEO, that might be fixing broken links, optimizing ALT attributes, optimizing title tags or producing a specific piece of content.

A strategy is a broader decision that’ll cause change or drive tactics. A long-term content policy is the easiest example. Shifting away from asynchronous content and moving to server-generated content is another example.

“Perfect”

No one knows exactly what Google considers “perfect,” and “perfect” can’t really exist, but you can bet a perfect web page/site would have all of the following:

  1. Completely visible content that’s perfectly relevant to the audience and query
  2. A flawless user experience
  3. Instant load time
  4. Zero duplicate content
  5. Every page easily indexed and classified
  6. No mistakes, broken links, redirects or anything else generally yucky
  7. Zero reported problems or suggestions in each search engines’ webmaster tools, sorry, “Search Consoles”
  8. Complete authority through immaculate, organically-generated links

These 8 categories (and any of the other bazillion that probably exist) give you a way to break down “perfect” and help you focus on what’s really going to move you forward. These different areas may involve different facets of your organization.

Your IT team can work on load time and creating an error-free front- and back-end. Link building requires the time and effort of content and outreach teams.

Tactics for relevant, visible content and current best practices in UX are going to be more involved, requiring research and real study of your audience.

What you need and what resources you have are going to impact which tactics are most realistic for you.

But there’s a basic rule: If a website would make Googlebot swoon and present zero obstacles to users, it’s close to perfect.

“All other things being equal”

Assume every competing website is optimized exactly as well as yours.

Now ask: Will this [tactic, change or strategy] move you closer to perfect?

That’s the “all other things being equal” rule. And it’s an incredibly powerful rubric for evaluating potential changes before you act. Pretend you’re in a tie with your competitors. Will this one thing be the tiebreaker? Will it put you ahead? Or will it cause you to fall behind?

“Closer to perfect than my competitors”

Perfect is great, but unattainable. What you really need is to be just a little perfect-er.

Chasing perfect can be dangerous. Perfect is the enemy of the good (I love that quote. Hated Voltaire. But I love that quote). If you wait for the opportunity/resources to reach perfection, you’ll never do anything. And the only way to reduce distance from perfect is to execute.

Instead of aiming for pure perfection, aim for more perfect than your competitors. Beat them feature-by-feature, tactic-by-tactic. Implement strategy that supports long-term superiority.

Don’t slack off. But set priorities and measure your effort. If fixing server response codes will take one hour and fixing duplication will take ten, fix the response codes first. Both move you closer to perfect. Fixing response codes may not move the needle as much, but it’s a lot easier to do. Then move on to fixing duplicates.

Do the 60% that gets you a 90% improvement. Then move on to the next thing and do it again. When you’re done, get to work on that last 40%. Repeat as necessary.

Take advantage of quick wins. That gives you more time to focus on your bigger solutions.

Sites that are “fine” are pretty far from perfect

Google has lots of tweaks, tools and workarounds to help us mitigate sub-optimal sites:

  • Rel=canonical lets us guide Google past duplicate content rather than fix it
  • HTML snapshots let us reveal content that’s delivered using asynchronous content and JavaScript frameworks
  • We can use rel=next and prev to guide search bots through outrageously long pagination tunnels
  • And we can use rel=nofollow to hide spammy links and banners

Easy, right? All of these solutions may reduce distance from perfect (the search engines don’t guarantee it). But they don’t reduce it as much as fixing the problems.
Just fine does not equal fixed

The next time you set up rel=canonical, ask yourself:

“All other things being equal, will using rel=canonical to make up for duplication move my site closer to perfect than my competitors?”

Answer: Not if they’re using rel=canonical, too. You’re both using imperfect solutions that force search engines to crawl every page of your site, duplicates included. If you want to pass them on your way to perfect, you need to fix the duplicate content.

When you use Angular.js to deliver regular content pages, ask yourself:

“All other things being equal, will using HTML snapshots instead of actual, visible content move my site closer to perfect than my competitors?”

Answer: No. Just no. Not in your wildest, code-addled dreams. If I’m Google, which site will I prefer? The one that renders for me the same way it renders for users? Or the one that has to deliver two separate versions of every page?

When you spill banner ads all over your site, ask yourself…

You get the idea. Nofollow is better than follow, but banner pollution is still pretty dang far from perfect.

Mitigating SEO issues with search engine-specific tools is “fine.” But it’s far, far from perfect. If search engines are forced to choose, they’ll favor the site that just works.

Not just SEO

By the way, distance from perfect absolutely applies to other channels.

I’m focusing on SEO, but think of other Internet marketing disciplines. I hear stuff like “How fast should my site be?” (Faster than it is right now.) Or “I’ve heard you shouldn’t have any content below the fold.” (Maybe in 2001.) Or “I need background video on my home page!” (Why? Do you have a reason?) Or, my favorite: “What’s a good bounce rate?” (Zero is pretty awesome.)

And Internet marketing venues are working to measure distance from perfect. Pay-per-click marketing has the quality score: A codified financial reward applied for seeking distance from perfect in as many elements as possible of your advertising program.

Social media venues are aggressively building their own forms of graphing, scoring and ranking systems designed to separate the good from the bad.

Really, all marketing includes some measure of distance from perfect. But no channel is more influenced by it than SEO. Instead of arguing one rule at a time, ask yourself and your boss or client: Will this move us closer to perfect?

Hell, you might even please a customer or two.

One last note for all of the SEOs in the crowd. Before you start pointing out edge cases, consider this: We spend our days combing Google for embarrassing rankings issues. Every now and then, we find one, point, and start yelling “SEE! SEE!!!! THE GOOGLES MADE MISTAKES!!!!” Google’s got lots of issues. Screwing up the rankings isn’t one of them.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Controlling Search Engine Crawlers for Better Indexation and Rankings – Whiteboard Friday

Posted by randfish

When should you disallow search engines in your robots.txt file, and when should you use meta robots tags in a page header? What about nofollowing links? In today’s Whiteboard Friday, Rand covers these tools and their appropriate use in four situations that SEOs commonly find themselves facing.

For reference, here’s a still of this week’s whiteboard. Click on it to open a high resolution image in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week we’re going to talk about controlling search engine crawlers, blocking bots, sending bots where we want, restricting them from where we don’t want them to go. We’re going to talk a little bit about crawl budget and what you should and shouldn’t have indexed.

As a start, what I want to do is discuss the ways in which we can control robots. Those include the three primary ones: robots.txt, meta robots, and—well, the nofollow tag is a little bit less about controlling bots.

There are a few others that we’re going to discuss as well, including Webmaster Tools (Search Console) and URL status codes. But let’s dive into those first few first.

Robots.txt lives at yoursite.com/robots.txt, it tells crawlers what they should and shouldn’t access, it doesn’t always get respected by Google and Bing. So a lot of folks when you say, “hey, disallow this,” and then you suddenly see those URLs popping up and you’re wondering what’s going on, look—Google and Bing oftentimes think that they just know better. They think that maybe you’ve made a mistake, they think “hey, there’s a lot of links pointing to this content, there’s a lot of people who are visiting and caring about this content, maybe you didn’t intend for us to block it.” The more specific you get about an individual URL, the better they usually are about respecting it. The less specific, meaning the more you use wildcards or say “everything behind this entire big directory,” the worse they are about necessarily believing you.

Meta robots—a little different—that lives in the headers of individual pages, so you can only control a single page with a meta robots tag. That tells the engines whether or not they should keep a page in the index, and whether they should follow the links on that page, and it’s usually a lot more respected, because it’s at an individual-page level; Google and Bing tend to believe you about the meta robots tag.

And then the nofollow tag, that lives on an individual link on a page. It doesn’t tell engines where to crawl or not to crawl. All it’s saying is whether you editorially vouch for a page that is being linked to, and whether you want to pass the PageRank and link equity metrics to that page.

Interesting point about meta robots and robots.txt working together (or not working together so well)—many, many folks in the SEO world do this and then get frustrated.

What if, for example, we take a page like “blogtest.html” on our domain and we say “all user agents, you are not allowed to crawl blogtest.html. Okay—that’s a good way to keep that page away from being crawled, but just because something is not crawled doesn’t necessarily mean it won’t be in the search results.

So then we have our SEO folks go, “you know what, let’s make doubly sure that doesn’t show up in search results; we’ll put in the meta robots tag:”

<meta name="robots" content="noindex, follow">

So, “noindex, follow” tells the search engine crawler they can follow the links on the page, but they shouldn’t index this particular one.

Then, you go and run a search for “blog test” in this case, and everybody on the team’s like “What the heck!? WTF? Why am I seeing this page show up in search results?”

The answer is, you told the engines that they couldn’t crawl the page, so they didn’t. But they are still putting it in the results. They’re actually probably not going to include a meta description; they might have something like “we can’t include a meta description because of this site’s robots.txt file.” The reason it’s showing up is because they can’t see the noindex; all they see is the disallow.

So, if you want something truly removed, unable to be seen in search results, you can’t just disallow a crawler. You have to say meta “noindex” and you have to let them crawl it.

So this creates some complications. Robots.txt can be great if we’re trying to save crawl bandwidth, but it isn’t necessarily ideal for preventing a page from being shown in the search results. I would not recommend, by the way, that you do what we think Twitter recently tried to do, where they tried to canonicalize www and non-www by saying “Google, don’t crawl the www version of twitter.com.” What you should be doing is rel canonical-ing or using a 301.

Meta robots—that can allow crawling and link-following while disallowing indexation, which is great, but it requires crawl budget and you can still conserve indexing.

The nofollow tag, generally speaking, is not particularly useful for controlling bots or conserving indexation.

Webmaster Tools (now Google Search Console) has some special things that allow you to restrict access or remove a result from the search results. For example, if you have 404’d something or if you’ve told them not to crawl something but it’s still showing up in there, you can manually say “don’t do that.” There are a few other crawl protocol things that you can do.

And then URL status codes—these are a valid way to do things, but they’re going to obviously change what’s going on on your pages, too.

If you’re not having a lot of luck using a 404 to remove something, you can use a 410 to permanently remove something from the index. Just be aware that once you use a 410, it can take a long time if you want to get that page re-crawled or re-indexed, and you want to tell the search engines “it’s back!” 410 is permanent removal.

301—permanent redirect, we’ve talked about those here—and 302, temporary redirect.

Now let’s jump into a few specific use cases of “what kinds of content should and shouldn’t I allow engines to crawl and index” in this next version…

[Rand moves at superhuman speed to erase the board and draw part two of this Whiteboard Friday. Seriously, we showed Roger how fast it was, and even he was impressed.]

Four crawling/indexing problems to solve

So we’ve got these four big problems that I want to talk about as they relate to crawling and indexing.

1. Content that isn’t ready yet

The first one here is around, “If I have content of quality I’m still trying to improve—it’s not yet ready for primetime, it’s not ready for Google, maybe I have a bunch of products and I only have the descriptions from the manufacturer and I need people to be able to access them, so I’m rewriting the content and creating unique value on those pages… they’re just not ready yet—what should I do with those?”

My options around crawling and indexing? If I have a large quantity of those—maybe thousands, tens of thousands, hundreds of thousands—I would probably go the robots.txt route. I’d disallow those pages from being crawled, and then eventually as I get (folder by folder) those sets of URLs ready, I can then allow crawling and maybe even submit them to Google via an XML sitemap.

If I’m talking about a small quantity—a few dozen, a few hundred pages—well, I’d probably just use the meta robots noindex, and then I’d pull that noindex off of those pages as they are made ready for Google’s consumption. And then again, I would probably use the XML sitemap and start submitting those once they’re ready.

2. Dealing with duplicate or thin content

What about, “Should I noindex, nofollow, or potentially disallow crawling on largely duplicate URLs or thin content?” I’ve got an example. Let’s say I’m an ecommerce shop, I’m selling this nice Star Wars t-shirt which I think is kind of hilarious, so I’ve got starwarsshirt.html, and it links out to a larger version of an image, and that’s an individual HTML page. It links out to different colors, which change the URL of the page, so I have a gray, blue, and black version. Well, these four pages are really all part of this same one, so I wouldn’t recommend disallowing crawling on these, and I wouldn’t recommend noindexing them. What I would do there is a rel canonical.

Remember, rel canonical is one of those things that can be precluded by disallowing. So, if I were to disallow these from being crawled, Google couldn’t see the rel canonical back, so if someone linked to the blue version instead of the default version, now I potentially don’t get link credit for that. So what I really want to do is use the rel canonical, allow the indexing, and allow it to be crawled. If you really feel like it, you could also put a meta “noindex, follow” on these pages, but I don’t really think that’s necessary, and again that might interfere with the rel canonical.

3. Passing link equity without appearing in search results

Number three: “If I want to pass link equity (or at least crawling) through a set of pages without those pages actually appearing in search results—so maybe I have navigational stuff, ways that humans are going to navigate through my pages, but I don’t need those appearing in search results—what should I use then?”

What I would say here is, you can use the meta robots to say “don’t index the page, but do follow the links that are on that page.” That’s a pretty nice, handy use case for that.

Do NOT, however, disallow those in robots.txt—many, many folks make this mistake. What happens if you disallow crawling on those, Google can’t see the noindex. They don’t know that they can follow it. Granted, as we talked about before, sometimes Google doesn’t obey the robots.txt, but you can’t rely on that behavior. Trust that the disallow in robots.txt will prevent them from crawling. So I would say, the meta robots “noindex, follow” is the way to do this.

4. Search results-type pages

Finally, fourth, “What should I do with search results-type pages?” Google has said many times that they don’t like your search results from your own internal engine appearing in their search results, and so this can be a tricky use case.

Sometimes a search result page—a page that lists many types of results that might come from a database of types of content that you’ve got on your site—could actually be a very good result for a searcher who is looking for a wide variety of content, or who wants to see what you have on offer. Yelp does this: When you say, “I’m looking for restaurants in Seattle, WA,” they’ll give you what is essentially a list of search results, and Google does want those to appear because that page provides a great result. But you should be doing what Yelp does there, and make the most common or popular individual sets of those search results into category-style pages. A page that provides real, unique value, that’s not just a list of search results, that is more of a landing page than a search results page.

However, that being said, if you’ve got a long tail of these, or if you’d say “hey, our internal search engine, that’s really for internal visitors only—it’s not useful to have those pages show up in search results, and we don’t think we need to make the effort to make those into category landing pages.” Then you can use the disallow in robots.txt to prevent those.

Just be cautious here, because I have sometimes seen an over-swinging of the pendulum toward blocking all types of search results, and sometimes that can actually hurt your SEO and your traffic. Sometimes those pages can be really useful to people. So check your analytics, and make sure those aren’t valuable pages that should be served up and turned into landing pages. If you’re sure, then go ahead and disallow all your search results-style pages. You’ll see a lot of sites doing this in their robots.txt file.

That being said, I hope you have some great questions about crawling and indexing, controlling robots, blocking robots, allowing robots, and I’ll try and tackle those in the comments below.

We’ll look forward to seeing you again next week for another edition of Whiteboard Friday. Take care!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Is that Mind-Blowing Title Blowing Your Credibility? You Decide

Posted by Isla_McKetta


Image of Tantalus courtesy of Clayton Cusak

What if I told you I could teach you to write the perfect headline? One that is so irresistible every person who sees it will click on it. You’d sign up immediately and maybe even promise me your firstborn.

But what if I then told you not one single person out of all the millions who will click on that headline will convert? And that you might lose all your credibility in the process. Would all the traffic generated by that “perfect” headline be worth it?

Help us solve a dispute

It isn’t really that bad, but with all the emphasis lately on
headline science and the curiosity gap, Trevor (your faithful editor) and I (a recovering copywriter) started talking about the importance of headlines and what their role should be in regards to content. I’m for clickability (as long as there is strong content to back the headline) and, if he has to choose, Trevor is for credibility (with an equal emphasis on quality of the eventual content).

credible vs clickable headlines

What’s the purpose of a headline?

Back in the good ol’ days, headlines were created to sell newspapers. Newsboys stood on street corners shouting the headlines in an attempt to hawk those newspapers. Headlines had to be enough of a tease to get readers interested but they had to be trustworthy enough to get a reader to buy again tomorrow. Competition for eyeballs was less fierce because a town only had so many newspapers, but paper cost money and editors were always happy to get a repeat customer.

Nowadays the competition for eyeballs feels even stiffer because it’s hard to get noticed in the vast sea of the internet. It’s easy to feel a little desperate. And it seems like the opportunity cost of turning away a customer is much lower than it was before. But aren’t we doing content as a product? Does the quality of that product matter?

The forbidden secrets of clickable headlines

There’s no arguing that headlines are important. In fact, at MozCon this year,
Nathalie Nahai reminded us that many copywriters recommend an 80:20 ratio of energy spent on headline to copy. That might be taking things a bit far, but a bad (or even just boring) headline will tank your traffic. Here is some expert advice on writing headlines that convert: 

  • Nahai advises that you take advantage of psychological trigger words like, “weird,” “free,” “incredible,” and “secret” to create a sense of urgency in the reader. Can you possibly wait to read “Secret Ways Butter can Save Your Life”?
  • Use question headlines like “Can You Increase Your Sales by 45% in Only 5 Minutes a Day?” that get a reader asking themselves, “I dunno, can I?” and clicking to read more.
  • Key into the curiosity gap with a headline like “What Mother Should Have Told You about Banking. (And How Not Knowing is Costing You Friends.)” Ridiculous claim? Maybe, but this kind of headline gets a reader hooked on narrative and they have to click through to see how the story comes together.
  • And if you’re looking for a formula for the best headlines ever, Nahai proposes the following:
    Number/Trigger word + Adjective + Keyword + Promise = Killer Headline.

Many readers still (consciously or not) consider headlines a promise. So remember, as you fill the headline with hyperbole and only write eleven of the twelve tips you set out to write, there is a reader on the other end hoping butter really is good for them.

The headline danger zone

This is where headline science can get ugly. Because a lot of “perfect” titles simply do not have the quality or depth of content to back them.

Those types of headlines remind me of the Greek myth of Tantalus. For sharing the secrets of the gods with the common folk, Tantalus was condemned to spend eternity surrounded by food and drink that were forever out of his reach. Now, content is hardly the secrets of the gods, but are we tantalizing our customers with teasing headlines that will never satisfy?

buzzfeed headlines

For me, reading headlines on
BuzzFeed and Upworthy and their ilk is like talking to the guy at the party with all those super wild anecdotes. He’s entertaining, but I don’t believe a word he says, soon wish he would shut up, and can’t remember his name five seconds later. Maybe I don’t believe in clickability as much as I thought…

So I turn to credible news sources for credible headlines.

washington post headlines

I’m having trouble deciding at this point if I’m more bothered by the headline at
The Washington Post, the fact that they’re covering that topic at all, or that they didn’t really go for true clickbait with something like “You Won’t Believe the Bizarre Reasons Girls Scream at Boy Band Concerts.” But one (or all) of those things makes me very sad. 

Are we developing an immunity to clickbait headlines?

Even
Upworthy is shifting their headline creation tactics a little. But that doesn’t mean they are switching from clickbait, it just means they’ve seen their audience get tired of the same old tactics. So they’re looking for new and better tactics to keep you engaged and clicking.

The importance of traffic

I think many of us would sell a little of our soul if it would increase our traffic, and of course those clickbaity curiosity gap headlines are designed to do that (and are mostly working, for now).

But we also want good traffic. The kind of people who are going to engage with our brand and build relationships with us over the long haul, right? Back to what we were discussing in the intro, we want the kind of traffic that’s likely to convert. Don’t we?

As much as I advocate for clickable headlines, the riskier the headline I write, the more closely I compare overall traffic (especially returning visitors) to click-throughs, time on page, and bounce rate to see if I’ve pushed it too far and am alienating our most loyal fans. Because new visitors are awesome, but loyal customers are priceless.

Headline science at Moz

At Moz, we’re trying to find the delicate balance between attracting all the customers and attracting the right customers. In my first week here when Trevor and Cyrus were polling readers on what headline they’d prefer to read, I advocated for a more clickable version. See if you can pick out which is mine…

headline poll

Yep, you guessed it. I suggested “Your Google Algorithm Cheat Sheet: Panda, Penguin, and Hummingbird” because it contained a trigger word and a keyword, plus it was punchy. I actually liked “A Layman’s Explanation of the Panda Algorithm, the Penguin Algorithm, and Hummingbird,” but I was pretty sure no one would click on it.

Last time I checked, that has more traffic than any other post for the month of June. I won’t say that’s all because of the headline—it’s a really strong and useful post—but I think the headline helped a lot.

But that’s just one data point. I’ve also been spicing up the subject lines on the Moz Top 10 newsletter to see what gets the most traffic.

most-read subject lines

And the results here are more mixed. Titles I felt like were much more clickbaity like “Did Google Kill Spam?…” and “Are You Using Robots.txt the Right Way?…” underperformed compared to the straight up “Moz Top 10.”

While the most clickbaity “Groupon Did What?…” and the two about Google selling domains (which was accurate but suggested that Google was selling it’s own domains, which worried me a bit) have the most opens overall.

Help us resolve the dispute

As you can tell, I have some unresolved feelings about this whole clickbait versus credibility thing. While Trevor and I have strong opinions, we also have a lot of questions that we hope you can help us with. Blow my mind with your headline logic in the comments by sharing your opinion on any of the following:

  • Do clickbait titles erode trust? If yes, do you ever worry about that affecting your bottom line?
  • Would you sacrifice credibility for clickability? Does it have to be a choice?
  • Is there such thing as a formula for a perfect headline? What standards do you use when writing headlines?
  • Does a clickbait title affect how likely you are to read an article? What about sharing one? Do you ever feel duped by the content? Does that affect your behavior the next time?  
  • How much of your soul would you sell for more traffic?

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 5 years ago from feedproxy.google.com

Experiment: We Removed a Major Website from Google Search, for Science!

Posted by Cyrus-Shepard

The folks at Groupon surprised us earlier this summer when they reported the
results of an experiment that showed that up to 60% of direct traffic is organic.

In order to accomplish this, Groupon de-indexed their site, effectively removing themselves from Google search results. That’s crazy talk!

Of course, we knew we had to try this ourselves.

We rolled up our sleeves and chose to de-index
Followerwonk, both for its consistent Google traffic and its good analytics setup—that way we could properly measure everything. We were also confident we could quickly bring the site back into Google’s results, which minimized the business risks.

(We discussed de-indexing our main site moz.com, but… no soup for you!)

We wanted to measure and test several things:

  1. How quickly will Google remove a site from its index?
  2. How much of our organic traffic is actually attributed as direct traffic?
  3. How quickly can you bring a site back into search results using the URL removal tool?

Here’s what happened.

How to completely remove a site from Google

The fastest, simplest, and most direct method to completely remove an entire site from Google search results is by using the
URL removal tool

We also understood, via statements form Google engineers, that using this method gave us the biggest chance of bringing the site back, with little risk. Other methods of de-indexing, such as using meta robots NOINDEX, might have taken weeks and caused recovery to take months.

CAUTION: Removing any URLs from a search index is potentially very dangerous, and should be taken very seriously. Do not try this at home; you will not pass go, and will not collect $200!

CAUTION: Removing any URLs from a search index is potentially very dangerous, and should be taken very seriously. Do not try this at home; you will not pass go, and will not collect $200!

After submitting the request, Followerwonk URLs started
disappearing from Google search results in 2-3 hours

The information needs to propagate across different data centers across the globe, so the effect can be delayed in areas. In fact, for the entire duration of the test, organic Google traffic continued to trickle in and never dropped to zero.

The effect on direct vs. organic traffic

In the Groupon experiment, they found that when they lost organic traffic, they
actually lost a bunch of direct traffic as well. The Groupon conclusion was that a large amount of their direct traffic was actually organic—up to 60% on “long URLs”.

At first glance, the overall amount of direct traffic to Followerwonk didn’t change significantly, even when organic traffic dropped.

In fact, we could find no discrepancy in direct traffic outside the expected range.

I ran this by our contacts at Groupon, who said this wasn’t totally unexpected. You see, in their experiment they saw the biggest drop in direct traffic on
long URLs, defined as a URL that is at least as long enough to be in a subfolder, like https://followerwonk.com/bio/?q=content+marketer.

For Followerwonk, the vast majority of traffic goes to the homepage and a handful of other URLs. This means we didn’t have a statistically significant sample size of long URLs to judge the effect. For the long URLs we were able to measure, the results were nebulous. 

Conclusion: While we can’t confirm the Groupon results with our outcome, we can’t discount them either.

It’s quite likely that a portion of your organic traffic is attributed as direct. This is because of different browsers, operating systems and user privacy settings can potentially block referral information from reaching your website.

Bringing your site back from death

After waiting 2 hours,
we deleted the request. Within a few hours all traffic returned to normal. Whew!

Does Google need to recrawl the pages?

If the time period is short enough, and you used the URL removal tool, apparently not.

In the case of Followerwonk, Google removed over
300,000 URLs from its search results, and made them all reappear in mere hours. This suggests that the domain wasn’t completely removed from Google’s index, but only “masked” from appearing for a short period of time.

What about longer periods of de-indexation?

In both the Groupon and Followerwonk experiments, the sites were only de-indexed for a short period of time, and bounced back quickly.

We wanted to find out what would happen if you de-indexed a site for a longer period, like
two and a half days?

I couldn’t convince the team to remove any of our sites from Google search results for a few days, so I choose a smaller personal site that I often subject to merciless SEO experiments.

In this case, I de-indexed the site and didn’t remove the request until three days later. Even with this longer period, all URLs returned within just
a few hours of cancelling the URL removal request.

In the chart below, we revoked the URL removal request on Friday the 25th. The next two days were Saturday and Sunday, both lower traffic days.

Test #2: De-index a personal site for 3 days

Likely, the URLs were still in Google’s index, so we didn’t have to wait for them to be recrawled. 

Here’s another shot of organic traffic before and after the second experiment.

For longer removal periods, a few weeks for example, I speculate Google might drop these semi-permanently from the index and re-inclusion would comprise a much longer time period.

What we learned

  1. While a portion of your organic traffic may be attributed as direct (due to browsers, privacy settings, etc) in our case the effect on direct traffic was negligible.
  2. If you accidentally de-index your site using Google Webmaster Tools, in most cases you can quickly bring it back to life by deleting the request.
  3. Reinclusion happens quickly even after we removed a site for over 2 days. Longer than this, the result is unknown, and you could have problems getting all the pages of your site indexed again.

Further reading

Moz community member Adina Toma wrote an excellent YouMoz post on the re-inclusion process using the same technique, with some excellent tips for other, more extreme situations.

Big thanks to
Peter Bray for volunteering Followerwonk for testing. You are a brave man!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 5 years ago from feedproxy.google.com