Sitemaps in the robots.txt Happy Harvesting
by Harun Yayli on Wednesday, April 11th, 2007 at 11:40 pm under Ask, Google, Hacks, Ideas, Internet, Security, Sitemaps, Web Standards, Yahoo
I’ve just read the Google Webmaster’s blog about the news on ask.com supporting Sitemaps.org’s sitemap format.
This is really a great news for all the people that like to be crawled faster and acurately.
For me the more interesting part about this news is that sitemaps.org’s proposal to include sitemaps into the robots.txt.
Simply you add a line into your robots.txt saying
Sitemap: <sitemap_location></sitemap_location>
This part is really cool but for site harvesters it’s an unbelivable tool. So you can handover the key to your site and web harvesters can crawl your site really easily because probably you’ve put all your site’s pages into your sitemap.
Sounds like a good plan in an ideal world. With all the cloakers and content scrapers you must be really smart not to be ripped apart.
My suggestion is to know who you’re serving the sitemap. Currently Google, Yahoo and Ask is supporting this sitemaps.xml and no other site has anything to do with it.
Here is a simple check you can add in the begining of your sitemap thing:
< ?php
function botIsAllowed($ip){
//get the reverse dns of the ip.
$host = strtolower(gethostbyaddr($ip));
$botDomains = array('.inktomisearch.com',
'.googlebot.com',
'.ask.com',
);
//search for the reverse dns matches the white list
foreach($botDomains as $bot){
if (strpos(strrev($host),strrev($bot))===0){
$qip= gethostbyname($host);
return ($qip==$ip);
}
}
return false;
}
if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
echo "Banned!";
exit;
}
?>
I’m sure everyone can get the idea of reverse dns and forward dns checking.
If I missed any decent site that uses the sitemaps let me know.
Note: If you’re still using static sitemaps (!) you can just include the xml after the code.
Recent Entries
- memcache.php can delete keys now
- memcache.php is now part of pecl/memcache
- memcache.php goes PECL
- memcache.php stats like apc.php
- oci_bind_by_name maxlength is not so optional
- Is Sun going to buy PHP too?(PHP Quebec 2008)
- PHP APC apc_shm_create error on CLI
- Facebook’s Buggy Spam Detection
- Is it Firefox or Zend Debugger? Cookie Standards
- ezComponents ready for prod?