Using Term Frequency Analysis to Measure Your Content Quality

Posted by EricEnge

It’s time to look at your content differently—time to start understanding just how good it really is. I am not simply talking about titles, keyword usage, and meta descriptions. I am talking about the entire page experience. In today’s post, I am going to introduce the general concept of content quality analysis, why it should matter to you, and how to use term frequency (TF) analysis to gather ideas on how to improve your content.

TF analysis is usually combined with inverse document frequency analysis (collectively TF-IDF analysis). TF-IDF analysis has been a staple concept for information retrieval science for a long time. You can read more about TF-IDF and other search science concepts in Cyrus Shepard’s
excellent article here.

For purposes of today’s post, I am going to show you how you can use TF analysis to get clues as to what Google is valuing in the content of sites that currently outrank you. But first, let’s get oriented.

Conceptualizing page quality

Start by asking yourself if your page provides a quality experience to people who visit it. For example, if a search engine sends 100 people to your page, how many of them will be happy? Seventy percent? Thirty percent? Less? What if your competitor’s page gets a higher percentage of happy users than yours does? Does that feel like an “uh-oh”?

Let’s think about this with a specific example in mind. What if you ran a golf club site, and 100 people come to your page after searching on a phrase like “golf clubs.” What are the kinds of things they may be looking for?

Here are some things they might want:

  1. A way to buy golf clubs on your site (you would need to see a shopping cart of some sort).
  2. The ability to select specific brands, perhaps by links to other pages about those brands of golf clubs.
  3. Information on how to pick the club that is best for them.
  4. The ability to select specific types of clubs (drivers, putters, irons, etc.). Again, this may be via links to other pages.
  5. A site search box.
  6. Pricing info.
  7. Info on shipping costs.
  8. Expert analysis comparing different golf club brands.
  9. End user reviews of your company so they can determine if they want to do business with you.
  10. How your return policy works.
  11. How they can file a complaint.
  12. Information about your company. Perhaps an “about us” page.
  13. A link to a privacy policy page.
  14. Whether or not you have been “in the news” recently.
  15. Trust symbols that show that you are a reputable organization.
  16. A way to access pages to buy different products, such as golf balls or tees.
  17. Information about specific golf courses.
  18. Tips on how to improve their golf game.

This is really only a partial list, and the specifics of your site can certainly vary for any number of reasons from what I laid out above. So how do you figure out what it is that people really want? You could pull in data from a number of sources. For example, using data from your site search box can be invaluable. You can do user testing on your site. You can conduct surveys. These are all good sources of data.

You can also look at your analytics data to see what pages get visited the most. Just be careful how you use that data. For example, if most of your traffic is from search, this data will be biased by incoming search traffic, and hence what Google chooses to rank. In addition, you may only have a small percentage of the visitors to your site going to your privacy policy, but chances are good that there are significantly more users than that who notice whether or not you have a privacy policy. Many of these will be satisfied just to see that you have one and won’t actually go check it out.

Whatever you do, it’s worth using many of these methods to determine what users want from the pages of your site and then using the resulting information to improve your overall site experience.

Is Google using this type of info as a ranking factor?

At some level, they clearly are. Clearly Google and Bing have evolved far beyond the initial TF-IDF concepts, but we can still use them to better understand our own content.

The first major indication we had that Google was performing content quality analysis was with the release of the
Panda algorithm in February of 2011. More recently, we know that on April 21 Google will release an algorithm that makes the mobile friendliness of a web site a ranking factor. Pure and simple, this algo is about the user experience with a page.

Exactly how Google is performing these measurements is not known, but
what we do know is their intent. They want to make their search engine look good, largely because it helps them make more money. Sending users to pages that make them happy will do that. Google has every incentive to improve the quality of their search results in as many ways as they can.

Ultimately, we don’t actually know what Google is measuring and using. It may be that the only SEO impact of providing pages that satisfy a very high percentage of users is an indirect one. I.e., so many people like your site that it gets written about more, linked to more, has tons of social shares, gets great engagement, that Google sees other signals that it uses as ranking factors, and this is why your rankings improve.

But, do I care if the impact is a direct one or an indirect one? Well, NO.

Using TF analysis to evaluate your page

TF-IDF analysis is more about relevance than content quality, but we can still use various precepts from it to help us understand our own content quality. One way to do this is to compare the results of a TF analysis of all the keywords on your page with those pages that currently outrank you in the search results. In this section, I am going to outline the basic concepts for how you can do this. In the next section I will show you a process that you can use with publicly available tools and a spreadsheet.

The simplest form of TF analysis is to count the number of uses of each keyword on a page. However, the problem with that is that a page using a keyword 10 times will be seen as 10 times more valuable than a page that uses a keyword only once. For that reason, we dampen the calculations. I have seen two methods for doing this, as follows:

term frequency calculation

The first method relies on dividing the number of repetitions of a keyword by the count for the most popular word on the entire page. Basically, what this does is eliminate the inherent advantage that longer documents might otherwise have over shorter ones. The second method dampens the total impact in a different way, by taking the log base 10 for the actual keyword count. Both of these achieve the effect of still valuing incremental uses of a keyword, but dampening it substantially. I prefer to use method 1, but you can use either method for our purposes here.

Once you have the TF calculated for every different keyword found on your page, you can then start to do the same analysis for pages that outrank you for a given search term. If you were to do this for five competing pages, the result might look something like this:

term frequency spreadsheet

I will show you how to set up the spreadsheet later, but for now, let’s do the fun part, which is to figure out how to analyze the results. Here are some of the things to look for:

  1. Are there any highly related words that all or most of your competitors are using that you don’t use at all?
  2. Are there any such words that you use significantly less, on average, than your competitors?
  3. Also look for words that you use significantly more than competitors.

You can then tag these words for further analysis. Once you are done, your spreadsheet may now look like this:

second stage term frequency analysis spreadsheet

In order to make this fit into this screen shot above and keep it legibly, I eliminated some columns you saw in my first spreadsheet. However, I did a sample analysis for the movie “Woman in Gold”. You can see the
full spreadsheet of calculations here. Note that we used an automated approach to marking some items at “Low Ratio,” “High Ratio,” or “All Competitors Have, Client Does Not.”

None of these flags by themselves have meaning, so you now need to put all of this into context. In our example, the following words probably have no significance at all: “get”, “you”, “top”, “see”, “we”, “all”, “but”, and other words of this type. These are just very basic English language words.

But, we can see other things of note relating to the target page (a.k.a. the client page):

  1. It’s missing any mention of actor ryan reynolds
  2. It’s missing any mention of actor helen mirren
  3. The page has no reviews
  4. Words like “family” and “story” are not mentioned
  5. “Austrian” and “maria altmann” are not used at all
  6. The phrase “woman in gold” and words “billing” and “info” are used proportionally more than they are with the other pages

Note that the last item is only visible if you open
the spreadsheet. The issues above could well be significant, as the lead actors, reviews, and other indications that the page has in-depth content. We see that competing pages that rank have details of the story, so that’s an indication that this is what Google (and users) are looking for. The fact that the main key phrase, and the word “billing”, are used to a proportionally high degree also makes it seem a bit spammy.

In fact, if you look at the information closely, you can see that the target page is quite thin in overall content. So much so, that it almost looks like a doorway page. In fact, it looks like it was put together by the movie studio itself, just not very well, as it presents little in the way of a home page experience that would cause it to rank for the name of the movie!

In the many different times I have done an analysis using these methods, I’ve been able to make many different types of observations about pages. A few of the more interesting ones include:

  1. A page that had no privacy policy, yet was taking personally identifiable info from users.
  2. A major lack of important synonyms that would indicate a real depth of available content.
  3. Comparatively low Domain Authority competitors ranking with in-depth content.

These types of observations are interesting and valuable, but it’s important to stress that you shouldn’t be overly mechanical about this. The value in this type of analysis is that it gives you a technical way to compare the content on your page with that of your competitors. This type of analysis should be used in combination with other methods that you use for evaluating that same page. I’ll address this some more in the summary section of this below.

How do you execute this for yourself?

The
full spreadsheet contains all the formulas so all you need to do is link in the keyword count data. I have tried this with two different keyword density tools, the one from Searchmetrics, and this one from motoricerca.info.

I am not endorsing these tools, and I have no financial interest in either one—they just seemed to work fairly well for the process I outlined above. To provide the data in the right format, please do the following:

  1. Run all the URLs you are testing through the keyword density tool.
  2. Copy and paste all the one word, two word, and three word results into a tab on the spreadsheet.
  3. Sort them all so you get total word counts aligned by position as I have shown in the linked spreadsheet.
  4. Set up the formulas as I did in the demo spreadsheet (you can just use the demo spreadsheet).
  5. Then do your analysis!

This may sound a bit tedious (and it is), but it has worked very well for us at STC.

Summary

You can also use usability groups and a number of other methods to figure out what users are really looking for on your site. However, what this does is give us a look at what Google has chosen to rank the highest in its search results. Don’t treat this as some sort of magic formula where you mechanically tweak the content to get better metrics in this analysis.

Instead, use this as a method for slicing into your content to better see it the way a machine might see it. It can yield some surprising (and wonderful) insights!

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Understanding and Applying Moz’s Spam Score Metric – Whiteboard Friday

Posted by randfish

This week, Moz released a new feature that we call Spam Score, which helps you analyze your link profile and weed out the spam (check out the blog post for more info). There have been some fantastic conversations about how it works and how it should (and shouldn’t) be used, and we wanted to clarify a few things to help you all make the best use of the tool.

In today’s Whiteboard Friday, Rand offers more detail on how the score is calculated, just what those spam flags are, and how we hope you’ll benefit from using it.

For reference, here’s a still of this week’s whiteboard. 

Click on the image above to open a high resolution version in a new tab!

Video transcription

Howdy Moz fans, and welcome to another edition of Whiteboard Friday. This week, we’re going to chat a little bit about Moz’s Spam Score. Now I don’t typically like to do Whiteboard Fridays specifically about a Moz project, especially when it’s something that’s in our toolset. But I’m making an exception because there have been so many questions and so much discussion around Spam Score and because I hope the methodology, the way we calculate things, the look at correlation and causation, when it comes to web spam, can be useful for everyone in the Moz community and everyone in the SEO community in addition to being helpful for understanding this specific tool and metric.

The 17-flag scoring system

I want to start by describing the 17 flag system. As you might know, Spam Score is shown as a score from 0 to 17. You either fire a flag or you don’t. Those 17 flags you can see a list of them on the blog post, and we’ll show that in there. Essentially, those flags correlate to the percentage of sites that we found with that count of flags, not those specific flags, just any count of those flags that were penalized or banned by Google. I’ll show you a little bit more in the methodology.

Basically, what this means is for sites that had 0 spam flags, none of the 17 flags that we had fired, that actually meant that 99.5% of those sites were not penalized or banned, on average, in our analysis and 0.5% were. At 3 flags, 4.2% of those sites, that’s actually still a huge number. That’s probably in the millions of domains or subdomains that Google has potentially still banned. All the way down here with 11 flags, it’s 87.3% that we did find banned. That seems pretty risky or penalized. It seems pretty risky. But 12.7% of those is still a very big number, again probably in the hundreds of thousands of unique websites that are not banned but still have these flags.

If you’re looking at a specific subdomain and you’re saying, “Hey, gosh, this only has 3 flags or 4 flags on it, but it’s clearly been penalized by Google, Moz’s score must be wrong,” no, that’s pretty comfortable. That should fit right into those kinds of numbers. Same thing down here. If you see a site that is not penalized but has a number of flags, that’s potentially an indication that you’re in that percentage of sites that we found not to be penalized.

So this is an indication of percentile risk, not a “this is absolutely spam” or “this is absolutely not spam.” The only caveat is anything with, I think, more than 13 flags, we found 100% of those to have been penalized or banned. Maybe you’ll find an odd outlier or two. Probably you won’t.

Correlation ≠ causation

Correlation is not causation. This is something we repeat all the time here at Moz and in the SEO community. We do a lot of correlation studies around these things. I think people understand those very well in the fields of social media and in marketing in general. Certainly in psychology and electoral voting and election polling results, people understand those correlations. But for some reason in SEO we sometimes get hung up on this.

I want to be clear. Spam flags and the count of spam flags correlates with sites we saw Google penalize. That doesn’t mean that any of the flags or combinations of flags actually cause the penalty. It could be that the things that are flags are not actually connected to the reasons Google might penalize something at all. Those could be totally disconnected.

We are not trying to say with the 17 flags these are causes for concern or you need to fix these. We are merely saying this feature existed on this website when we crawled it, or it had this feature, maybe it still has this feature. Therefore, we saw this count of these features that correlates to this percentile number, so we’re giving you that number. That’s all that the score intends to say. That’s all it’s trying to show. It’s trying to be very transparent about that. It’s not trying to say you need to fix these.

A lot of flags and features that are measured are perfectly fine things to have on a website, like no social accounts or email links. That’s a totally reasonable thing to have, but it is a flag because we saw it correlate. A number in your domain name, I think it’s fine if you want to have a number in your domain name. There’s plenty of good domains that have a numerical character in them. That’s cool.

TLD extension that happens to be used by lots of spammers, like a .info or a .cc or a number of other ones, that’s also totally reasonable. Just because lots of spammers happen to use those TLD extensions doesn’t mean you are necessarily spam because you use one.

Or low link diversity. Maybe you’re a relatively new site. Maybe your niche is very small, so the number of folks who point to your site tends to be small, and lots of the sites that organically naturally link to you editorially happen to link to you from many of their pages, and there’s not a ton of them. That will lead to low link diversity, which is a flag, but it isn’t always necessarily a bad thing. It might still nudge you to try and get some more links because that will probably help you, but that doesn’t mean you are spammy. It just means you fired a flag that correlated with a spam percentile.

The methodology we use

The methodology that we use, for those who are curious — and I do think this is a methodology that might be interesting to potentially apply in other places — is we brainstormed a large list of potential flags, a huge number. We cut that down to the ones we could actually do, because there were some that were just unfeasible for our technology team, our engineering team to do.

Then, we got a huge list, many hundreds of thousands of sites that were penalized or banned. When we say banned or penalized, what we mean is they didn’t rank on page one for either their own domain name or their own brand name, the thing between the
www and the .com or .net or .info or whatever it was. If you didn’t rank for either your full domain name, www and the .com or Moz, that would mean we said, “Hey, you’re penalized or banned.”

Now you might say, “Hey, Rand, there are probably some sites that don’t rank on page one for their own brand name or their own domain name, but aren’t actually penalized or banned.” I agree. That’s a very small number. Statistically speaking, it probably is not going to be impactful on this data set. Therefore, we didn’t have to control for that. We ended up not controlling for that.

Then we found which of the features that we ideated, brainstormed, actually correlated with the penalties and bans, and we created the 17 flags that you see in the product today. There are lots things that I thought were going to correlate, for example spammy-looking anchor text or poison keywords on the page, like Viagra, Cialis, Texas Hold’em online, pornography. Those things, not all of them anyway turned out to correlate well, and so they didn’t make it into the 17 flags list. I hope over time we’ll add more flags. That’s how things worked out.

How to apply the Spam Score metric

When you’re applying Spam Score, I think there are a few important things to think about. Just like domain authority, or page authority, or a metric from Majestic, or a metric from Google, or any other kind of metric that you might come up with, you should add it to your toolbox and to your metrics where you find it useful. I think playing around with spam, experimenting with it is a great thing. If you don’t find it useful, just ignore it. It doesn’t actually hurt your website. It’s not like this information goes to Google or anything like that. They have way more sophisticated stuff to figure out things on their end.

Do not just disavow everything with seven or more flags, or eight or more flags, or nine or more flags. I think that we use the color coding to indicate 0% to 10% of these flag counts were penalized or banned, 10% to 50% were penalized or banned, or 50% or above were penalized or banned. That’s why you see the green, orange, red. But you should use the count and line that up with the percentile. We do show that inside the tool as well.

Don’t just take everything and disavow it all. That can get you into serious trouble. Remember what happened with Cyrus. Cyrus Shepard, Moz’s head of content and SEO, he disavowed all the backlinks to its site. It took more than a year for him to rank for anything again. Google almost treated it like he was banned, not completely, but they seriously took away all of his link power and didn’t let him back in, even though he changed the disavow file and all that.

Be very careful submitting disavow files. You can hurt yourself tremendously. The reason we offer it in disavow format is because many of the folks in our customer testing said that’s how they wanted it so they could copy and paste, so they could easily review, so they could get it in that format and put it into their already existing disavow file. But you should not do that. You’ll see a bunch of warnings if you try and generate a disavow file. You even have to edit your disavow file before you can submit it to Google, because we want to be that careful that you don’t go and submit.

You should expect the Spam Score accuracy. If you’re doing spam investigation, you’re probably looking at spammier sites. If you’re looking at a random hundred sites, you should expect that the flags would correlate with the percentages. If I look at a random hundred 4 flag Spam Score sites, 7.5% of those I would expect on average to be penalized or banned. If you are therefore seeing sites that don’t fit those, they probably fit into the percentiles that were not penalized, or up here were penalized, down here weren’t penalized, that kind of thing.

Hopefully, you find Spam Score useful and interesting and you add it to your toolbox. We would love to hear from you on iterations and ideas that you’ve got for what we can do in the future, where else you’d like to see it, and where you’re finding it useful/not useful. That would be great.

Hopefully, you’ve enjoyed this edition of Whiteboard Friday and will join us again next week. Thanks so much. Take care.

Video transcription by Speechpad.com

ADDITION FROM RAND: I also urge folks to check out Marie Haynes’ excellent Start-to-Finish Guide to Using Google’s Disavow Tool. We’re going to update the feature to link to that as well.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from tracking.feedpress.it

Unraveling Panda Patterns

Posted by billslawski

This is my first official blog post at Moz.com, and I’m going to be requesting your help and expertise and imagination.

I’m going to be asking you to take over as Panda for a little while to see if you can identify the kinds of things that Google’s Navneet Panda addressed when faced with what looked like an incomplete patent created to identify sites as parked domain pages, content farm pages, and link farm pages. You’re probably better at this now then he was then.

You’re a subject matter expert.

To put things in perspective, I’m going to include some information about what appears to be the very first Panda patent, and some of Google’s effort behind what they were calling the “high-quality site algorithm.”

I’m going to then include some of the patterns they describe in the patent to identify lower-quality pages, and then describe some of the features I personally would suggest to score and rank a higher-quality site of one type.

Google’s Amit Singhal identified a number of questions about higher quality sites that he might use, and told us in the blog post where he listed those that it was an incomplete list because they didn’t want to make it easy for people to abuse their algorithm.

In my opinion though, any discussion about improving the quality of webpages is one worth having, because it can help improve the quality of the Web for everyone, which Google should be happy to see anyway.

Warning searchers about low-quality content

In “Processing web pages based on content quality,” the original patent filing for Panda, there’s a somewhat mysterious statement that makes it sound as if Google might warn searchers before sending them to a low quality search result, and give them a choice whether or not they might actually click through to such a page.

As it notes, the types of low quality pages the patent was supposed to address included parked domain pages, content farm pages, and link farm pages (yes,
link farm pages):

“The processor 260 is configured to receive from a client device (e.g., 110), a request for a web page (e.g., 206). The processor 260 is configured to determine the content quality of the requested web page based on whether the requested web page is a parked web page, a content farm web page, or a link farm web page.

Based on the content quality of the requested web page, the processor is configured to provide for display, a graphical component (e.g., a warning prompt). That is, the processor 260 is configured to provide for display a graphical component (e.g., a warning prompt) if the content quality of the requested web page is at or below a certain threshold.

The graphical component provided for display by the processor 260 includes options to proceed to the requested web page or to proceed to one or more alternate web pages relevant to the request for the web page (e.g., 206). The graphical component may also provide an option to stop proceeding to the requested web page.

The processor 260 is further configured to receive an indication of a selection of an option from the graphical component to proceed to the requested web page, or to proceed to an alternate web page. The processor 260 is further configured to provide for display, based on the received indication, the requested web page or the alternate web page.”

This did not sound like a good idea.

Recently, Google announced in a post on the Google Webmaster Central blog post,
Promoting modern websites for modern devices in Google search results, that they would start providing warning notices on mobile versions of sites if there were issues on those pages that visitors might go to.

I imagine that as a site owner, you might be disappointed seeing such warning notice shown to searchers on your site about technology used on your site possibly not working correctly on a specific device. That recent blog post mentions Flash as an example of a technology that might not work correctly on some devices. For example, we know that Apple’s mobile devices and Flash don’t work well together.

That’s not a bad warning in that it provides enough information to act upon and fix to the benefit of a lot of potential visitors. 🙂

But imagine if you tried to visit your website in 2011, and instead of getting to the site, you received a Google warning that the page you were trying to visit was a content farm page or a link farm page, and it provided alternative pages to visit as well.

That ”
your website sucks” warning still doesn’t sound like a good idea. One of the inventors listed on the patent is described in LinkedIn as presently working on the Google Play store. The warning for mobile devices might have been something he brought to Google from his work on this Panda patent.

We know that when the Panda Update was released that it was targeting specific types of pages that people at places such as
The New York Times were complaining about, such as parked domains and content farm sites. A
follow-up from the Timesafter the algorithm update was released puts it into perspective for us.

It wasn’t easy to know that your pages might have been targeted by that particular Google update either, or if your site was a false positive—and many site owners ended up posting in the Google Help forums after a Google search engineer invited them to post there if they believed that they were targeted by the update when they shouldn’t have been.

The wording of that
invitation is interesting in light of the original name of the Panda algorithm. (Note that the thread was broken into multiple threads when Google did a migration of posts to new software, and many appear to have disappeared at some point.)

As we were told in the invite from the Google search engineer:

“According to our metrics, this update improves overall search quality. However, we are interested in hearing feedback from site owners and the community as we continue to refine our algorithms. If you know of a high-quality site that has been negatively affected by this change, please bring it to our attention in this thread.

Note that as this is an algorithmic change we are unable to make manual exceptions, but in cases of high quality content we can pass the examples along to the engineers who will look at them as they work on future iterations and improvements to the algorithm.

So even if you don’t see us responding, know that we’re doing a lot of listening.”

The timing for such in-SERP warnings might have been troublesome. A site that mysteriously stops appearing in search results for queries that it used to rank well for might be said to have gone astray of
Google’s guidelines. Instead, such a warning might be a little like the purposefully embarrassing “Scarlet A” in Nathaniel Hawthorn’s novel The Scarlet Letter.

A page that shows up in search results with a warning to searchers stating that it was a content farm, or a link farm, or a parked domain probably shouldn’t be ranking well to begin with. Having Google continuing to display those results ranking highly, showing both a link and a warning to those pages, and then diverting searchers to alternative pages might have been more than those site owners could handle. Keep in mind that the fates of those businesses are usually tied to such detoured traffic.

My imagination is filled with the filing of lawsuits against Google based upon such tantalizing warnings, rather than site owners filling up a Google Webmaster Help Forum with information about the circumstances involving their sites being impacted by the upgrade.

In retrospect, it is probably a good idea that the warnings hinted at in the original Panda Patent were avoided.

Google seems to think that such warnings are appropriate now when it comes to multiple devices and technologies that may not work well together, like Flash and iPhones.

But there were still issues with how well or how poorly the algorithm described in the patent might work.

In the March, 2011 interview with Google’s Head of Search Quality, Amit Sighal, and his team member and Head of Web Spam at Google, Matt Cutts, titled
TED 2011: The “Panda” That Hates Farms: A Q&A With Google’s Top Search Engineers, we learned of the code name that Google claimed to be using to refer to the algorithm update as “Panda,” after an engineer with that name came along and provided suggestions on patterns that could be used by the patent to identify high- and low-quality pages.

His input seems to have been pretty impactful—enough for Google to have changed the name of the update, from the “High Quality Site Algorithm” to the “Panda” update.

How the High-Quality Site Algorithm became Panda

Danny Sullivan named the update the “Farmer update” since it supposedly targeted content farm web sites. Soon afterwards the joint interview with Singhal and Cutts identified the Panda codename, and that’s what it’s been called ever since.

Google didn’t completely abandon the name found in the original patent, the “high quality sites algorithm,” as can be seen in the titles of these Google Blog posts:

The most interesting of those is the “more guidance” post, in which Amit Singhal lists 23 questions about things Google might look for on a page to determine whether or not it was high-quality. I’ve spent a lot of time since then looking at those questions thinking of features on a page that might convey quality.

The original patent is at:

Processing web pages based on content quality
Inventors: Brandon Bilinski and Stephen Kirkham
Assigned to Google

US Patent 8,775,924

Granted July 8, 2014

Filed: March 9, 2012

Abstract

“Computer-implemented methods of processing web pages based on content quality are provided. In one aspect, a method includes receiving a request for a web page.

The method includes determining the content quality of the requested web page based on whether it is a parked web page, a content farm web page, or a link farm web page. The method includes providing for display, based on the content quality of the requested web page, a graphical component providing options to proceed to the requested web page or to an alternate web page relevant to the request for the web page.

The method includes receiving an indication of a selection of an option from the graphical component to proceed to the requested web page or to an alternate web page. The method further includes providing, based on the received indication, the requested web page or an alternate web page.

The patent expands on what are examples of low-quality web pages, including:

  • Parked web pages
  • Content farm web pages
  • Link farm web pages
  • Default pages
  • Pages that do not offer useful content, and/or pages that contain advertisements and little else

An invitation to crowdsource high-quality patterns

This is the section I mentioned above where I am asking for your help. You don’t have to publish your thoughts on how quality might be identified, but I’m going to start with some examples.

Under the patent, a content quality value score is calculated for every page on a website based upon patterns found on known low-quality pages, “such as parked web pages, content farm web pages, and/or link farm web pages.”

For each of the patterns identified on a page, the content quality value of the page might be reduced based upon the presence of that particular pattern—and each pattern might be weighted differently.

Some simple patterns that might be applied to a low-quality web page might be one or more references to:

  • A known advertising network,
  • A web page parking service, and/or
  • A content farm provider

One of these references may be in the form of an IP address that the destination hostname resolves to, a Domain Name Server (“DNS server”) that the destination domain name is pointing to, an “a href” attribute on the destination page, and/or an “img src” attribute on the destination page.

That’s a pretty simple pattern, but a web page resolving to an IP address known to exclusively serve parked web pages provided by a particular Internet domain registrar can be deemed a parked web page, so it can be pretty effective.

A web page with a DNS server known to be associated with web pages that contain little or no content other than advertisements may very well provide little or no content other than advertising. So that one can be effective, too.

Some of the patterns listed in the patent don’t seem quite as useful or informative. For example, the one stating that a web page containing a common typographical error of a bona fide domain name may likely be a low-quality web page, or a non-existent web page. I’ve seen more than a couple of legitimate sites with common misspellings of good domains, so I’m not too sure how helpful a pattern that is.

Of course, some textual content is a dead giveaway the patent tells us, with terms on them such as “domain is for sale,” “buy this domain,” and/or “this page is parked.”

Likewise, a web page with little or no content is probably (but not always) a low-quality web page.

This is a simple but effective pattern, even if not too imaginative:

… page providing 99% hyperlinks and 1% plain text is more likely to be a low-quality web page than a web page providing 50% hyperlinks and 50% plain text.

Another pattern is one that I often check upon and address in site audits, and it involves how functional and responsive pages on a site are.

The determination of whether a web site is full functional may be based on an HTTP response code, information received from a DNS server (e.g., hostname records), and/or a lack of a response within a certain amount of time. As an example, an HTTP response that is anything other than 200 (e.g., “404 Not Found”) would indicate that a web site is not fully functional.

As another example, a DNS server that does not return authoritative records for a hostname would indicate that the web site is not fully functional. Similarly, a lack of a response within a certain amount of time, from the IP address of the hostname for a web site would indicate that the web site is not fully functional.

As for user-data, sometimes it might play a role as well, as the patent tells us:

A web page may be suggested for review and/or its content quality value may be adapted based on the amount of time spent on that page.

For example, if a user reaches a web page and then leaves immediately, the brief nature of the visit may cause the content quality value of that page to be reviewed and/or reduced. The amount of time spent on a particular web page may be determined through a variety of approaches. For example, web requests for web pages may be used to determine the amount of time spent on a particular web page.”

My example of some patterns for an e-commerce website

There are a lot of things that you might want to include on an ecommerce site that help to indicate that it’s high quality. If you look at the questions that Amit Singhal raised in the last Google Blog post I mentioned above, one of his questions was “Would you be comfortable giving your credit card information to this site?” Patterns that might fit with this question could include:

  • Is there a privacy policy linked to on pages of the site?
  • Is there a “terms of service” page linked to on pages of the site?
  • Is there a “customer service” page or section linked to on pages of the site?
  • Do ordering forms function fully on the site? Do they return 404 pages or 500 server errors?
  • If an order is made, does a thank-you or acknowledgement page show up?
  • Does the site use an https protocol when sending data or personally identifiable data (like a credit card number)?

As I mentioned above, the patent tells us that a high-quality content score for a page might be different from one pattern to another.

The
questions from Amit Singhal imply a lot of other patterns, but as SEOs who work on and build and improve a lot of websites, this is an area where we probably have more expertise than Google’s search engineers.

What other questions would you ask if you were tasked with looking at this original Panda Patent? What patterns would you suggest looking for when trying to identify high or low quality pages?  Perhaps if we share with one another patterns or features on a site that Google might look for algorithmically, we could build pages that might not be interpreted by Google as being a low quality site. I provided a few patterns for an ecommerce site above. What patterns would you suggest?

(Illustrations: Devin Holmes @DevinGoFish)

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Reblogged 4 years ago from feedproxy.google.com