Black Hat Web Design: Cloaking

Last updated: August 14, 2016 by: Shawn

Cloaking refers to specifically targeting GoogleBot/BingBot and delivery different content than a user.

The official stance of cloaking from Google:



That didn’t really explain too much and I get why really, they don’t want you to know whats going on.
 
Let’s break it down:
 
If you already know about how HTTP requests work, or don’t care and just want to learn about “Cloaking” click HERE
 
When your web browser loads a webpage, it does so by sending what is called a “http request”. That request has several parts to it:

1. The Request-line

The Request-Line begins with a method, then the Request-URI + protocol version.
 
Each paramater of the request line is seperated with the SP characters and the server knows it has reached the end of the request when it sees: CR LF.
 
 

HTTP Request-Line Format =

 
Method SP Request-URI SP HTTP-Version CRLF
 
The reason it ends with CL RF:
 
CR = US-ASCII CR, carriage return (13)
LF = US-ASCII LF, linefeed (10)
 
The request method is basically telling the server what the browser wants to DO, look at the request method as the “verb” in the sentence. These are the accepted methods in http 1.1:

METHOD
GET
HEAD
POST
PUT
DELETE
CONNECT
OPTIONS
TRACE

DESCRIPTION
The GET method is used to retrieve information from the given server using a given URI.
Same as GET, but it transfers the status line and the header section only.
A POST request is used to send data to the server using HTML forms.
Replaces all the current representations of the target resource with the uploaded content.
Removes all the current representations of the target resource given by URI.
Establishes a tunnel to the server identified by a given URI.
Describe the communication options for the target resource.
Performs a message loop back test along with the path to the target resource.

The request URI is the Universal Resource Identifer, or the location of the resource on the server you want to interact with. This is commonly refered to as a URL by users because they are used to using that type of GET request directly in the browser bar. When you type “http://blackhatwebdesign.com” inside the browser, you are actually using a GET request to make a TCP connection to port 80 and retrieving the specific resource you asked for.
 
It’s easier to say URL 😉

There are 3 main URI Requests to send to the server:

METHOD
*
ABSOLUTE URI
TCP:80

DESCRIPTION
The * can only be used when directly communicating with the server
Used when the request is to a proxy, this is what CDN uses.
The catch-all, request a resource from the server AKA load a webpage.

Note that the absolute path cannot be empty; if none is present in the original URI, it MUST be given as “/” (the server root).

2. The Request Headers

The request-header are the modifiers and additional information about the request or the client itself to the server. If you have ever used google chrome’s inspect feature and then set the “mobile bar” to a specific phone, you are changing the http header request. There are so many requests that I am not going to list them here, there will be a link toa an exhaustive list at the end of this post, but what we are concerned with here is one in particular:
 
User-Agent
 
Thanks to sites like http://www.user-agents.org/ you can see the possible agents that a client can send to the server, or the common ones that we run into. This is super helpful when designing sites because you can “spoof” the user agent and see how the server reacts to the request. A great way to test your server security and web design, but that’s another post.

3. An empty line (i.e., a line with nothing preceding the CRLF)

Just an empty null character to help identify the end of resource request.

4. Optional message-body

Too much for this post, but this allows the client to tell the server specific things about the request, it is like the message body of an email. Loading a webpage only takes the To: and Subject: fields, but more complex requests like telling the server a form length or checking hash data will be used here. For cloaking purposes, it’s not important.


Great we know how it works, now what’s the deal with cloaking already?


So as you can imagine, googlebot has a specific user-agent for each of the crawlers it sends out, you can find out more about each one here: https://support.google.com/webmasters/answer/1061943?hl=en.

Now, there are 100% legitimate reasons to identify googlebot on your page, and you may not want googlebot to index certain pages for various reasons. For instance, there is a concept of “crawl budget” that refers to the amount of time and level the googlebot will spend and go on your site. Having it spend time crawling your category, archive and calendar pages may be wasteful and not allow the posts you spend so much time to get crawled.

You would control your crawl budget with a robots.txt file anyway, so crawl budget and blocking don’t really affect cloaking at all really.

So, what does?

Simple. If you identify the user agent as a google crawl bot and provide different content to that bot than to the user, you are risking some serious smack down from the spam team.

Your User-Agent is:

 

This is what your client has sent to the server, you are reading this webpage because you sent an absolute request to the server and because we are running apache and wordpress, we have a .htaccess file that tells the server our perma-link structure, which gives you the URI of the database query that makes this post out of thin air. YaaaaY PHP.

Now, what I have shown you here is simply jQuery/JavaScript, but you would be really, really dumb to try and cloak with javascript, as its crawlable and fully readable by googlebot. So if you tried to identify the agent and the dynamically change the content on the page, even if you used AJAX and loaded it dynamically, it would still be there.

So you can in theroy cloak your website for the googlebot with JavaScript, but you would be shooting yourself in the foot even more than the act of cloaking lol. Don’t do it. I promise you, it will not end well for you.

If you are extrapolating out some ideas of how to change content based on http request headers, then good on you, you are thining about personalization and not cloaking, which is the next level of website design and one I talk about a lot in the blog.

Now, you can’t do it client side or you will draw a map, so you have to do it server side right? Ok, so php to the rescue once again!

PHP has built in syntax for just this:
 
get_browser
http://php.net/manual/en/function.get-browser.php

 
 
So your php template would include:
 
 
<?php
echo $_SERVER[‘HTTP_USER_AGENT’] . “\n\n”;
$browser = get_browser(null, true);
print_r($browser);
?>

You can see this in action here getuseragent.php

So you can see that you can identify and then change the content of the page server side depending on the user agent, which Google gives you a list of.
 
So in theroy you could identify if the page was being crawled by Google, and then server different content to that specific user agent.
 
That, in a nutshell ladies and gentlemen, is cloaking. Determining if the user agent is googlebot and then changing the content to suit that.
 
So you could have a page that has donkey porn on it, but the googlebot only sees “The little mermaid song lyrics”. Obviously this extreme example will get reported and your page will be deindexed faster than you can say boo, but what about in the non-extreme examples?

Can Google determine if you are cloaking by using an algorithm? Not if you are smart, no.
 
Should you provide Googlebot with different content than you provide the user? Not if you are smart, no
 

See the point is, we are not dealing with a perfect logical computer, we are dealing with humans with expectations. If you don’t meet those expectations, those humans are going to have an emotional response. If you are creating the best content and serving the user exactly what he wants, you are going to benefit more in the long run than trying to game googlebot for specific keyword ratios and the other bullshit everyone wants to tell you.

There is a gray area here though and one which you have to decide if its for you, a LOT of less than perfect developers will deliver a “slightly” over-optimized version of the content to googlebot. This is maybe 5-7% higher ratio of LSI keywords than would normally be serverd to a human, as it would read like you are forcing keywords into the sentence and wouldn’t be natural, but to the robot reading it, it makes perfect sense.

Definitely don’t recommend you try to out smart Google on this one though, cloaking is the #1 thing the webspam team takes seriously.

As a sidenote, here is the http request sent to load this page:

GET /black-hat-web-design-cloaking/ HTTP/1.1
Host: blackhatwebdesign.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Cookie: cookie data here

Hope You learned something, if you have any questions, please hit me up!
3