What is an XML Sitemap
Almost all information you need to know about Sitemaps is available at http://www.sitemaps.org.
From this site:
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
At typical Sitemap (usually called Sitemap, with a capital S, also known as Google Sitemap) is an XML file with the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/catalog?item=74&desc=vacation_newfoundland</loc>
<lastmod>2004-12-23T18:00:15+00:00</lastmod>
</url>
</urlset>
Of course, the <url></url> section is repeated for each URL you want to include in the Sitemap file.
This example is a static XML file. A dynamic site (based on a database) has to create a dynamic Sitemap (generated and updated by a script, not manually).
After Sitemap creation, it must be submitted to search engines (see below). Sitemaps are very important for Search Engine Optimization (SEO). Creating and maintaining an accurate Sitemap is important to proper indexing of your website pages. More here from Google.
Simple rules:
- Generally, Sitemap file must be named “sitemap.xml” and must be located at the website root folder.
- If it is possible, use gzip to compress your Sitemaps.
- Sitemap.xml must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes), whether compressed or not. In these cases, you can use Sitemap index files (a group of Sitemaps).
- In the <url></url> section, only the <loc></loc> (location) tag is required. It is recommended to include the <lastmod></lastmod> tag, but not required. Other optional tags are <changefreq></changefreq> and <priority></priority> . Read more here.
- Sitemap.xml must be a UTF-8 encoded file. All values (between tags) must be entity escaped.
- Date or datetime values inside <lastmod></lastmod> should be in W3C Datetime format. Example YYYY-MM-DDTHH:II:SS+02:00 (where T is the delimiter between date and time and +02:00 is the UTC offset). Using of time is optional. A simple YYYY-MM-DD structure is valid.
When you need to create a custom Sitemap
If you use the popular WordPress platform of any similar software, there are plugins available to create for you the XML Sitemap.
So, when you need to create a custom Sitemap? If your blogging platform does not support Sitemap creation or your site is something more than a typical blog, containing more sections. This is my case and I will describe it in this post.
pontikis.net consists of:
- the Blog, based on a custom blogging engine
- the Labs section, with open source software demo and tutorials
- Forums, based on FluxBB
So, I use partial Sitemap files, one for each site section. The main sitemap.xml file is actually a sitemapindex with the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.pontikis.net/sitemap-main.xml</loc>
</sitemap>
<sitemap>
<loc>http://www.pontikis.net/blog/sitemap.php</loc>
</sitemap>
<sitemap>
<loc>http://www.pontikis.net/labs/sitemap-labs.xml</loc>
</sitemap>
<sitemap>
<loc>http://www.pontikis.net/bbs/sitemap.php</loc>
</sitemap>
</sitemapindex>
The partial Sitemaps are:
- sitemap-main.xml is a static sitemap (manually updated) containing the basic pages (home, about etc)
- blog/sitemap.php is a dynamic sitemap (see below) containing the blog home, the blog archive and of course the blog posts (from the database)
- labs/sitemap-labs.xml is a static sitemap (manually updated) containing the lab pages
- bbs/sitemap.php is a dymanic sitemap created by FluxBB plugin.
So, let’s see how to create a dynamic Sitemap of blog posts which are stored in a database:
The code
Syntax highlight using http://alexgorbatchev.com/SyntaxHighlighter/
<?php
header('Content-type: application/xml');
require_once '../common/settings.php'; // database settings
require_once PROJECT_PATH . '/lib/php_adodb_v5.18/adodb.inc.php';
require_once PROJECT_PATH . '/lib/small_blog_v0.8.0/smallblog.php'; // custom blogging engine
require_once PROJECT_PATH . '/lib/utils/utils.php'; // utility functions: date_decode, now
// configuration
$url_prefix = 'http://www.pontikis.net/blog/';
$blog_timezone = 'UTC';
$timezone_offset = '+00:00';
$W3C_datetime_format_php = 'Y-m-dTh:i:s'; // See http://www.w3.org/TR/NOTE-datetime
$null_sitemap = '<urlset><url><loc></loc></url></urlset>';
$blog = new smallblog(); // custom blogging engine
$res = $blog->db_connect($blog_db_settings);
if($res === false) {
echo $null_sitemap;
exit; // Database connection error...
} else {
// get all posts meta-data
$posts = $blog->getPosts(0, 0, '', '', '', now($blog_timezone));
if($posts === false) {
echo $null_sitemap;
exit; // Error retreiving posts...
}
$len = count($posts);
for($i = 0; $i < $len; $i++) {
// entities encode URL according http://www.sitemaps.org/protocol.html#escaping
$posts[$i]['url'] = $url_prefix . htmlspecialchars($posts[$i]['url']);
// convert dates to W3C datetime format http://www.sitemaps.org/protocol.html#xmlTagDefinitions
$posts[$i]['date_updated'] = date_decode($posts[$i]['date_updated'], $blog_timezone, $W3C_datetime_format_php) . $timezone_offset;
}
// retrieve max date
$max_date = $posts[0]['date_updated'];
}
$output = '<?xml version="1.0" encoding="UTF-8"?>' . "n";
$output .= '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">' . "n";
echo $output;
?>
<url>
<loc>http://www.pontikis.net/blog/</loc>
<lastmod><?php print $max_date ?></lastmod>
<changefreq>daily</changefreq>
</url>
<url>
<loc>http://www.pontikis.net/blog/archive/</loc>
<lastmod><?php print $max_date ?></lastmod>
<changefreq>daily</changefreq>
</url>
<?php for($i = 0; $i < $len; $i++) { ?>
<url>
<loc><?php print $posts[$i]['url'] ?></loc>
<lastmod><?php print $posts[$i]['date_updated'] ?></lastmod>
</url>
<?php } ?>
</urlset>
Code explanation
Inform browser that an XML file will be created
Line 2: header('Content-type:
application/xml');
Connect to database and get an array of posts URLs and publish dates
Lines 16 – 28: In this example
custom blogging engine and php ADODB are
used. But the same result (array $posts
) can occur with usual MySQL
statements.
Convert URL and dates according to Sitemap protocol specifications
Line 33: Convert URLs using htmlspecialchars (see specification).
Line 35: Convert dates to W3C datetime
format. In this example function date_decode
is used (see
this post for details). But any php code could be used. In my
case publish dates are stored in UTC ($timezone_offset = '+00:00';
)
Output XML sitemap
Lines 46 – 55: Include the blog home
URL and blog archive URL using as <lastmod>
the date of the most
recent post ($max_date
).
Lines 56 – 61: Finally, iterate over
the $posts
array and
create the rest of the Sitemap.
Sitemap validation
It is recommended to validate the sitemap file before submission to search engines. Many online tools are available:
- http://www.w3.org/2001/03/webdata/xsv
- http://validator.w3.org/#validate_by_uri+with_options
- http://www.xmlcheck.com/
- http://www.validome.org/google/validate (recommended)
How to submit Sitemap to search engines
Once you have created the Sitemap file, you need to inform the search engines (which support this protocol). You can do this by:
- using the search engine’s submission interface (known as
Webmaster Tools)
- Google: Google Webmaster Tools
- Bing: Bing Webmaster Tools
- Yandex: Yandex Webmaster Tools
- Baidu: Baidu Webmaster Tools (not English interface yet, you will need Google translation service)
- sending an HTTP request (see more)
- specifying the location in your site’s robots.txt file
(strongly recommended)
To do this, include a statement like:
Sitemap: http://www.yoursite.com/sitemap.xml
How to re-submit Sitemap when content changes
Theoretically, you have to re-submit a Sitemap, when site content changes. Either with manually submission or automated using scripts to “ping” the search engine.
But, once a Sitemap is submitted, search engines will regulary come back and reload the Sitemap looking for new URLs, whether you re-submit it or not. Much more if Sitemap location is specified in robots.txt file, as search engines first look at robots.txt file.
In conclusion:
- submit Sitemap using search engines interface the first time.
- specify the Sitemap location in robots.txt
- optionally re-submit Sitemap at infrequent intervals.
Entrepreneur | Full-stack developer | Founder of MediSign Ltd. I have over 15 years of professional experience designing and developing web applications. I am also very experienced in managing (web) projects.