Search engines are powerful tools, yet it is surprisingly simple to provide direction to search engines and ensure that content is shown (and hidden) appropriately in search results. Likewise, it is surprisingly easy to make mistakes and instruct a search engine to show (or hide) content unintentionally. Understanding the basic tools that search engines provide content authors to guide search robots enables you to have important content appear in search results, and prevents you from making costly mistakes.

Indexing Methods

Search engines use three basic actions to discover new content to present in search results:

Crawl
page content is read by search engines
Follow
in-page links are traversed by search engines
Index
page is recorded and may appear in search results

Search engines provide content authors with tools to control behavior when new content is discovered: the robots.txt file and the robots tag.

The robots tag controls indexing and following, while the robots.txt file controls crawling (evidence of which can be seen in the Google search results: “A description for this result is not available because of this site’s robots.txt”).

Screenshot of Google Search results exhibiting content blocked by robots.txt

There are a number of directives (described by Google’s robots.txt specification as “guidelines for a crawler or group of crawlers”) that allow content authors to prevent search engines from crawling, following and indexing content; by default, there are no restrictions for crawling; pages are treated as crawlable [and] indexable… unless permission is specifically denied.

The primary directives for restricting search engine behavior are:

disallow
Do not crawl, access (i.e. read content from) the specified paths.
noindex
Do not index this page (i.e do not show in search results).
nofollow
Do not follow (i.e. attempt to crawl) links on the page.

Crawl Directives in robots.txt

The robots.txt is a plain text file accessible at the top-level directory of the host (domain) encoded in UTF-8.

Note: Crawlers will not check for robots.txt files in subdirectories.

In standard UNIX fashion, comments can be included at any location in the file using the “#” character; all remaining characters on that line are treated as a comment and are ignored.

Interestingly, “a maximum file size may be enforced per crawler. Content which exceeds the maximum file size may be ignored. Google currently enforces a size limit of 500 kilobytes”.

The robots.txt file is broken up into sections (called “groups”), each defined/separated by a user-agent record.

Only one group of group-member records is valid for a particular crawler… All other groups of records are ignored by the crawler… The order of the groups within the robots.txt file is irrelevant.

Google’s robots.txt specification: Order of precedence for user-agents

Non-human Readable Files

Ensure that search engines can crawl CSS and JavaScript files.

Disallowing crawling of JavaScript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.

Updating our technical Webmaster Guidelines

If resources like JavaScript or CSS in separate files are blocked (say, with robots.txt) so that Googlebot can’t retrieve them, our indexing systems won’t be able to see your site like an average user. We recommend allowing Googlebot to retrieve JavaScript and CSS so that your content can be indexed better. This is especially important for mobile websites, where external resources like CSS and JavaScript help our algorithms understand that the pages are optimized for mobile.

Understanding web pages better

Robots Tag

The robots tag may be defined either directly in the HTML as a meta tag, or in the HTTP X-Robots-Tag header.

<meta name="robots" content="noindex, nofollow">
X-Robots-Tag: noindex, nofollow

The quantity and order of robots directives does not matter, if multiple directives exist, the most restrictive directive will be used:

If competing directives are encountered by our crawlers we will use the most restrictive directive we find.

Robots meta tag and X-Robots-Tag HTTP header specifications

Noindex a Section of a Site

Using the X-Robots-Tag HTTP header, it is possible to prevent search engines from indexing sections of a site, and/or particular file types:

# Prevent indexing of all pages with `/category/` in the URL
RewriteCond %{THE_REQUEST} /category/
RewriteRule ^ - [ENV=NOINDEX:true]
<IfModule mod_headers.c>
    Header set X-Robots-Tag "noindex" env=NOINDEX
</IfModule>

# Prevent indexing of image files (.png, .jpeg, .jpg, .gif)
<Files ~ "\.(png|jpe?g|gif)$">
  Header set X-Robots-Tag "noindex"
</Files>

De-indexing Catch 22

Even when the robots.txt is set to disallow, and the header of every page includes a noindex, nofollow directive, pages which have already been indexed may continue to appear in the index – “stuck” in perpetuity. This happens because the robots.txt file, when set to disallow, instructs the crawler not to crawl (i.e. read the contents of) the page, so the noindex directive is never read and the search engine does not know to remove the pages from the index. As Google Search Console’s Remove URLs tool instructs, the best way to de-index the site is to allow the search engine to crawl the site.

Screenshot of Google Search Console's Remove URLs tool

To block a page from appearing in Google Search results permanently… add a NOINDEX tag to the page and allow it to be crawled by Googlebot.

Solution:

  1. Allow search engines to crawl pages by remove disallow from robots.txt
  2. Add a noindex HTTP header or meta tag to the pages
  3. Claim the domain (in Google Search Console) and submit the URLs for removal

NOTE: To remove an entire domain from Google Search results, enter a single forward slash (“/”) into the prompt in Google Search Console’s Remove URLs tool.

Screenshot of Google Search Console's Remove URLs prompt

Once you’ve successfully removed the URLs from Google Search results, the SERP page should appear empty.

Screenshot of empty Google Search results