The problem
URL (also known as web address) is an abbreviation of Uniform Resource Locator. It is a string that constitutes a reference to a web resource.
The following URLs may lead to same web page:
http://www.example.com
http://example.com
http://www.example.com/index.php
http://www.example.com/index.php?goback=somewhere
This is a bad practice. Each reference to the same resource (web page) must use an identical URL.
If more than one URLs lead to the same page, it is possible to
- Duplicate Google (or other search engines) index (increased possibility of split page rank)
- Lose Facebook likes, Tweets and similar social media rankings
- Lose Disqus (or similar services) comments
This situation may affect the functionality of any service which uses the URL to identify a web page (resource). That’s why URL Consistency is so important.
The solution
It is a complex problem and possible solutions vary case by case. Here are some available solutions:
- Use a single Canonical HostName
- Strip unwanted query strings from incoming URLs
Use a single Canonical HostName
Most websites response to hostname either contains www or not. That is right. However, it is recommended to redirect www to non-www hostname or the opposite. Which one to select? There are arguments for each choice. See http://no-www.org and http://www.yes-www.org.
- redirect non-WWW to WWW: google, bing, baidu, qq, amazon, alexa, youtube, wikipedia, blogger, reddit, mozilla, facebook, linkedin, stumbleupon, microsoft, apple, tumblr, paypal, bbc
- redirect WWW to non-WWW: twitter, wordpress, vimeo, github, jquery, sourceforge, pinterest, instagram, delicious
Actually, you can select anyone you prefer, but to have to use it permanently.
I prefer the non-WWW to WWW redirection. Here is how non-WWW redirected to WWW using Apache configuration files (in Debian):
Except of main configuration file, which looks like:
<VirtualHost 95.211.47.207:80>
ServerName www.pontikis.net
DocumentRoot /var/www/pontikis.net
</VirtualHost>
another configuration file is created:
nano /etc/apache2/sites-available/pontikis.net
with the following content:
<VirtualHost 95.211.47.207:80>
ServerName pontikis.net
Redirect / http://www.pontikis.net/
</VirtualHost>
If you don’t want to directly change Apache configuartion, you
may use mod_rewrite. In order to redirect non-WWW to
WWW, create an .htaccess
file in the server root with
the following content:
RewriteCond %{HTTP_HOST} !^www.example.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/?(.*) http://www.example.com/$1 [L,R,NE]
Strip unwanted query strings from incoming URLs
A query string is the part of a URL that contains data to be passed to web applications such as CGI programs. For example:
http://www.example.com/?id=1&category=sales
The part of URL after the question mark (“?”) is the Query String (id=1&category=sales).
Some websites (among them some great services as Linkedin and Feedburner) include query strings to incoming URLs of your website for tracking purposes (like “?goback=” etc). In most cases these query strings considered “unwanted” and could be stripped.
Here is a solution for Apache web server. Similar solutions are available for Microsoft IIS and NGINX web servers.
I use an .htaccess
file in the server root with the
following content:
RewriteEngine On
RewriteCond %{QUERY_STRING} !=""
RewriteCond %{REQUEST_URI} !^/search.*
RewriteCond %{REQUEST_URI} !^/wiki.*
RewriteCond %{REQUEST_URI} !^/bbs.*
RewriteCond %{REQUEST_URI} !^/admin.*
RewriteRule ^(.*)$ /$1? [R=301,L]
- Line 2: If query string exists
- Line 3-6: Exclude directories search, wiki, bbs, admin
- Line 7: Remove query string
Using RewriteCond you can exclude any QUERY_STRING or REQUEST_URI, according to your needs. Of course, we will never strip query strings, we are using in our website.
WARNING: There are no universally valid
solutions. You should read carefully Apache mod_rewrite documentation and create
.htaccess
according to your own environment.
For example, WordPress users might need the following lines:
RewriteCond %{QUERY_STRING} !^p=.*
RewriteCond %{REQUEST_URI} !^/wp-admin.*
- Line 1: allow post tempalinks
- Line 2: Exclude admin directory
Use simple Feedburner URLs
If you select Feedburner to track detailed statistics for your feed, the URLs to your website contains query strings like utm_source and &utm_medium). FeedBurner URL seems like http://feedproxy.google.com/~r/YourFeedName/…
In order Feedburner URLs to be exactly as your site URLs, navigate to the Analyze tab, click on Configure Stats and deselect checkbox for Item link clicks as follows:
Share your experience with other web servers (e.g. Microsoft IIS). Do you prefer WWW or non-WWW? Leave us a comment.
Entrepreneur | Full-stack developer | Founder of MediSign Ltd. I have over 15 years of professional experience designing and developing web applications. I am also very experienced in managing (web) projects.