More sitemap issues!
on Thursday, April 19th, 2007
It’s a jungle out there!
Now I realized, yahoo is indexing the sitemap directly at the search results!!!!!!!!
this is redicilous.
Check this link the last entry is from my sitemap.
web, money and etc.
on Thursday, April 19th, 2007
It’s a jungle out there!
Now I realized, yahoo is indexing the sitemap directly at the search results!!!!!!!!
this is redicilous.
Check this link the last entry is from my sitemap.
on Wednesday, April 18th, 2007
After their initiative to put the sitemaps into the robots.txt I recently posted about how to identify robots to server the sitemap or not. I believe it’s extremely important for the webmaster’s to protect themselves from site scrapers. Sitemaps in the robots.txt is like a highway sign pointing the easy way to scrape a site.
With this idea in my mind, I also added my sitemap a small snippet to check the bot if it’s from a company that I like to serve the sitemap.
Read more…
on Wednesday, April 11th, 2007
I’ve just read the Google Webmaster’s blog about the news on ask.com supporting Sitemaps.org’s sitemap format.
This is really a great news for all the people that like to be crawled faster and acurately.
For me the more interesting part about this news is that sitemaps.org’s proposal to include sitemaps into the robots.txt.
Simply you add a line into your robots.txt saying
Sitemap: <sitemap_location></sitemap_location>
This part is really cool but for site harvesters it’s an unbelivable tool. So you can handover the key to your site and web harvesters can crawl your site really easily because probably you’ve put all your site’s pages into your sitemap.
Sounds like a good plan in an ideal world. With all the cloakers and content scrapers you must be really smart not to be ripped apart.
My suggestion is to know who you’re serving the sitemap. Currently Google, Yahoo and Ask is supporting this sitemaps.xml and no other site has anything to do with it.
Here is a simple check you can add in the begining of your sitemap thing:
< ?php
function botIsAllowed($ip){
//get the reverse dns of the ip.
$host = strtolower(gethostbyaddr($ip));
$botDomains = array('.inktomisearch.com',
'.googlebot.com',
'.ask.com',
);
//search for the reverse dns matches the white list
foreach($botDomains as $bot){
if (strpos(strrev($host),strrev($bot))===0){
$qip= gethostbyname($host);
return ($qip==$ip);
}
}
return false;
}
if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
echo "Banned!";
exit;
}
?>
I’m sure everyone can get the idea of reverse dns and forward dns checking.
If I missed any decent site that uses the sitemaps let me know.
Note: If you’re still using static sitemaps (!) you can just include the xml after the code.