When launching a website on a new platform or when changing the information architecture of a website, it is usually necessary to configure redirects to ensure users can locate the content they expect even when accessing the website via legacy URLs. These redirects also instruct search engines to reindex content at a new location and transfer page ranking.

Different Types of Redirects

Redirects are HTTP status codes which instruct the browser to perform a certain action. There are many different status codes which fall into different classes. Redirect statuses follow the pattern 3XX, with the primary status codes for redirects being 301 (permanent redirect) and 302 (temporary redirect), though there are other statuses – notably the 307 (temporary redirect) and 308 (permanent redirect) – that specify the user agent must not change the HTTP method used in the original request.

Permanent Redirects

The 301 (and 308) status code instructs the browser that the content of a resource has been moved permanently to a new location. In other words, the browser is told that it should no longer request the original location, instead using the new location for all future requests.

When a search engine robot encounters a permanent redirect it replaces the indexed URL with the new URL and transfers (a percentage) of PageRank to the new URL.

When a web browser encounters a permanent redirect it caches the redirect and will not attempt to request the original location again, instead pointing to the new location, until the cache is cleared.

It is possible to unset a cached permanent redirect by re-redirecting back to the original URL from the destination URL. For this reason, be cautious when performing permanent redirects to external targets, as it will be impossible to unset the redirect.

Temporary Redirects

The 302 (and 307) status code instructs the browser that the content of a resource has been moved temporarily to a new location. In other words, because this redirect may be altered occasionally, the browser should continue to request the original location for future requests.

Interestingly temporary redirects receive some perhaps unexpected treatment by search engines: “for on-domain [temporary] redirects… search engines will usually [index] the shorter URL” because “normal users usually like short, clean URLs.”

It is a common misconception that temporary redirects do not transfer PageRank, they do. Though it has also been suggested that 302 redirects transfer less PageRank than 301 redirects (which may be true).

Keep in mind that search engines are constantly updating their algorithms to deliver the best results. More than anything, they want webmasters to build websites that provide a great user experience and valuable content. Scheming to acquire more “link juice” is not a valid argument for using an improper redirect; in fact, Google has taken steps in the past to reduce visibility into PageRank to “avoid confusing users and webmasters about the significance of the metric.” As Matt Cutts says “use whatever is best for your purposes.”

Locating URLs

Before writing a redirect you must know three things:

  1. The source URL
  2. The destination (or target) URL
  3. The redirect status

Acquiring source URLs

The easiest way to acquire a list of known URLs is to locate a website’s sitemap(s). The following script can be used to search for sitemap declarations in the robots.txt file and by testing for popular sitemap filenames.

##
# Locate sitemap path for a given domain
#
# @param [url] Domain name to search
##
function find_sitemap() {
  # Get sitemap entries from a `robots.txt` file
  sitemap=$(curl -sL "$1/robots.txt" | awk 'tolower($1) ~ /^sitemap/ {print $2}')
  if [[ -z $sitemap ]]; then
    # Try different combinations of capitalization and punctuation
    # sitemap site_map site-map
    output=$(curl -sLIw "%{http_code} %{url_effective}" "$1/sitemap.xml" -o /dev/null)
    if [[ $(echo $output | awk '{print $1}') == 200 ]]; then
      sitemap=$(echo $output | awk '{print $2}')
    else
      echo "No sitemap found"
      return 1
    fi
  fi
  echo "$sitemap"
}

find_sitemap $@

Alternatively, if the site has a Google Webmaster Tools account configured, you may check the Crawl Sitemap section to see if any sitemaps are known to Google.

If none of the above methods have located a valid sitemap, you may need to resort to searching Google for possible sitemaps. The search queries below may assist in locating a sitemap:

site:example.com filetype:xml inurl:sitemap

site:example.com filetype:xml

If you still cannot locate a valid sitemap file, you may need to build a list based on search engine indexes.

Extracting URLs from a Sitemap

In addition to a list of URLs, sitemaps include other information, unnecessary for the purposes of constructing redirects. To extract URLs from a sitemap, use the following script:

##
# Parse URLs out of a `sitemap.xml` file
# This will follow nested sitemap URLs
#
# @param [string] File or URL or sitemap
##
function parse_sitemap() {
  # Check is sitemap is local or remote
  net=
  if [[ "$1" =~ ^https?:// ]]; then
    http_code=$(curl -sLIw "%{http_code}" "$1" -o /dev/null)
    if [[ $http_code != 200 ]]; then
      return 1
    fi
    net='--net'
  fi
  # Must match the XML namespace to work correctly
  xmlns='http://www.sitemaps.org/schemas/sitemap/0.9'
  # Get all URLs - remove blank line if no URLs are found
  xml sel $net -N x=${xmlns} -t -v '//x:url/x:loc' -n "$1" | sed '/^\s*$/d'
  # Recursively parse embedded sitemaps
  sitemaps=$(xml sel $net -N x=${xmlns} -t -v '//x:sitemap/x:loc' -n "$1" | sed '/^\s*$/d')
  if [[ -n "$sitemaps" ]]; then
    while read -r sitemap; do
      parse_sitemap "$sitemap"
    done <<< "$sitemaps"
  fi
}

parse_sitemap $@

Note: This script requires XMLStarlet, and will descend recursively into nested sitemaps.

Writing Redirects

Once a list of source URLs is obtained, it’s time to begin mapping to the destination URL(s).

Note: This section describes the different modules and directives that may be used to author redirects in Apache; be aware of your server type and version.

There are two Apache modules that may be used to construct redirects:

These modules provide three directives that can be used for redirection; mod_alias provides Redirect and RedirectMatch while mod_rewrite provides RewriteRule. As noted by the Apache documentation for mod_alias:

mod_alias is designed to handle simple URL manipulation tasks. For more complicated tasks … use the tools provided by mod_rewrite.

The following rules may be applied when selecting the appropriate directive for a redirect:

Redirect
Simple one-to-one redirect mapping.
RedirectMatch
Intermediate redirect mapping supporting regular expressions.
RewriteRule
Advanced redirect mapping supporting regular expressions, conditionals and access to the query string.

Apache has also published helpful documentation on when not to use mod_rewrite, which echos the above:

[S]imple redirection of one URL, or a class of URLs, to somewhere else, should be accomplished using [the Redirect and RedirectMatch] directives rather than RewriteRule. RedirectMatch allows you to include a regular expression in your redirection criteria, providing many of the benefits of using RewriteRule.

The Redirect directive

The Redirect directive is extremely simplistic; it maps one URL to another:

Redirect 301 "/source" "/destination"

Note: In the example above (and with the other directives mentioned below), the redirect’s HTTP status code argument may be omitted. If no status code is defined, the redirect will be a 302 temporary redirect.

There are a few keywords that may be used in place of a numeric status code, which may improve readability. These keywords are limited however, and numeric status codes offer more flexibility and allow for greater consistency when a number of different redirect statuses are defined.

To improve clarity, always include a numeric status code with all redirect directives.

The RedirectMatch directive

The RedirectMatch directive expands the utility of Redirect while maintaining it’s elegant simplicity. Any valid Redirect directive is also a valid RedirectMatch.

RedirectMatch 301 "/source" "/destination"

The real power of RedirectMatch derives from its ability to interpret regular expressions (commonly referred to as regex). To master RedirectMatch (or redirection in general), learn regular expressions. Understanding even the basics of regex can be extremely valuable when redirecting a large number of URLs.

# Remove an extension from all URLs
RedirectMatch 301 "(.*)\.html" "$1"

# Move/rename a category
RedirectMatch 301 "/old-category/(.*)" "/new-category/$1"

# Redirect subdirectory to subdomain
RedirectMatch 301 "^/(blog)/(.*)$" "http://$1.example.com/$2"

This illustrates the power of using RedirectMatch with regular expressions.

Nevertheless, there is still much that RedirectMatch cannot accomplish.

The RedirectRule directive

The RedirectRule directive uses a more complex syntax to accomplish redirection. This added complexity affords much more flexibility and power.

RedirectRule /source /destination [R=301,L]

Tip: Some servers may not have mod_rewrite enabled. To avoid 500 errors wrap mod_rewrite directives in an IfModule section. Also, ensure that the RewriteEngine directive has been set to “On”, otherwise the RedirectRule directives will have no effect.

<IfModule mod_rewrite.c>
  RewriteEngine On
  # …
</IfModule>

The power of RewriteRule lies in its ability to be combined with conditional statements (RewriteCond), and access server variables, environment variables, HTTP headers, and time stamps.

# Redirect insecure, non-www to secure, www
RewriteCond %{HTTPS} !on [OR]
RewriteCond %{HTTP_HOST} !^www\. [NC]
RewriteCond %{HTTP_HOST} ^(?:www\.)?(.*) [NC]
RewriteRule (.*) https://www.%1/$1 [R=301,L]

# Collapse multiple slashes
# Note: The "$0" in the substitution string matches
# the entire contents of the matched pattern,
# regardless of grouping
RewriteCond %{THE_REQUEST} //
RewriteRule .* /$0 [R=301,NE]

# Parse data from a query string
# and remove the query string from the URL
# Note: In Apache 2.4 or later, the "QSD" option (qsdiscard)
# can be used to remove the query string opposed to
# appending a "?" the the substitution string
RewriteCond %{QUERY_STRING} page=([^&]+)
RewriteRule ^ /%1? [R=301,L]

# Redirect certain devices by user agent
RewriteCond %{HTTP_USER_AGENT} "i(phone|pad|pod)" [NC]
RewriteRule ^ http://m.example.com/ [R=302,L]

# Serve pre-compressed assets
RewriteCond %{HTTP:accept-encoding} \b(gz)ip\b
RewriteCond %{REQUEST_FILENAME} !.+\.%1$
RewriteCond %{REQUEST_FILENAME}.%1 -f [OR]
RewriteCond %{REQUEST_FILENAME}.%1 -l
RewriteRule (.*) $1.%1 [QSA,L]

Note: Be aware that not all directives are interpreted sequentially due to hooks.

As the Apache documentation notes:

The use of RewriteRule … may be appropriate if there are other RewriteRule directives in the same scope. [W]hen there are Redirect and RewriteRule directives in the same scope, the RewriteRule directives will run first, regardless of the order of appearance in the configuration file.

Placing a Redirect directive before a RewriteRule directive does not mean it will be executed first. mod_rewrite is loaded before mod_alias because of hooks, so RewriteRule directives are executed before Redirect and RedirectMatch directives.

Testing Redirects

While writing redirects, it is helpful to use a command line tool for testing. This avoids browser caching.

Because a percentage of PageRank “dissipates” with each redirect, and because each redirect contributes to a slower time to first byte, the total number of redirects between the source URL and destination URL (redirect “hops”) should be kept to a minimum. Whenever possible, each redirect should be accomplished in only one hop. The following script will display all the redirect hops of a single URL:

##
# Display all redirects (and their status code) between
# a given URL and the "effective URL"
#
# @param [url] The URL to crawl
##
function redirects() {
  url="$1";
  while data=$(curl -Iso /dev/null -w '%{http_code} %{redirect_url}' "$url"); do
    http_code=$(echo "$data" | awk '{print $1}')
    echo "$http_code $url"
    url=$(echo "$data" | awk '{print $2}')
  done
}

redirects $@

While the script above is helpful in a micro-sense – allowing developers to test individual URLs – the script below is more helpful in the macro-sense – allowing developers to test a large number of URLs at once.

Given a list of URLs to test, this script will display each URL in a file alongside the effective URL and it’s status code in CSV format.

##
# Test URLs from a list
# Prints HTTP status code, redirect hops, source URL, and effective URL
# in CSV format
#
# @param [file] List of URLs to test
# @param [string] Base URL to test URLs against
##
function test_urls() {
  base_url="$2"
  if [ -n "$base_url" ]; then
    # Remove protocol and domain name and
    # replace with new base URL
    sed -E "s%^https?://[^/]*%%g" "${1:-/dev/stdin}" | xargs -I {} \
      curl -o /dev/null --head -sLw \
      "%{http_code},%{num_redirects},${base_url}{},%{url_effective}\\n" "${base_url}{}"
  else
    cat "${1:-/dev/stdin}" | xargs -I {} \
      curl -o /dev/null --head -sLw \
      "%{http_code},%{num_redirects},{},%{url_effective}\\n" "{}"
  fi
}

test_urls $@

Tip: When analyzing the data generated by the script above, filter content by the status codes. Unsuccessful requests (≠ 200) – particularly 404 – indicate an error which may need to be addressed with redirects.