<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Mental notes to myself &#187; Ask</title>
	<atom:link href="http://livebookmark.net/journal/category/ask/feed/" rel="self" type="application/rss+xml" />
	<link>http://livebookmark.net/journal</link>
	<description>web, money and etc.</description>
	<lastBuildDate>Tue, 08 Feb 2011 21:41:51 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Ask.com and Their bot!</title>
		<link>http://livebookmark.net/journal/2007/04/18/askcom-and-their-bot/</link>
		<comments>http://livebookmark.net/journal/2007/04/18/askcom-and-their-bot/#comments</comments>
		<pubDate>Wed, 18 Apr 2007 23:25:11 +0000</pubDate>
		<dc:creator>Harun Yayli</dc:creator>
				<category><![CDATA[Ask]]></category>
		<category><![CDATA[Internet]]></category>
		<category><![CDATA[Sitemaps]]></category>

		<guid isPermaLink="false">http://livebookmark.net/journal/2007/04/18/askcom-and-their-bot/</guid>
		<description><![CDATA[After their initiative to put the sitemaps into the robots.txt I recently posted about how to identify robots to server the sitemap or not. I believe it&#8217;s extremely important for the webmaster&#8217;s to protect themselves from site scrapers. Sitemaps in the robots.txt is like a highway sign pointing the easy way to scrape a site.
With [...]]]></description>
			<content:encoded><![CDATA[<p>After their initiative to put the sitemaps into the robots.txt I recently posted about how to identify robots to server the sitemap or not. I believe it&#8217;s extremely important for the webmaster&#8217;s to protect themselves from site scrapers. Sitemaps in the robots.txt is like a highway sign pointing the easy way to scrape a site.<br />
With this idea in my mind, I also added my sitemap a small snippet to check the bot if it&#8217;s from a company that I like to serve the sitemap.<br />
<span id="more-145"></span><br />
It&#8217;s been 3-4 days and I received a warning from my webserver that someone that I don&#8217;t know tried to get my sitemap.<br />
The ip was 65.119.214.9. reverse DNS resolves as ext9.eds.jeeves.ask.info, but no ip defined for the host name!<br />
So I started to dig around the ip to see who it belongs to.<br />
Well actually it was from Ask.com, or someone looks like related to ask.com.<br />
I was surprised about the misconfiguration and I filled out the <a href="http://about.ask.com/en/docs/about/customer_service.shtml">crawler feedback form at the ask.com</a> to report this.<br />
I received a response from them claiming the bot on sunday but I&#8217;ve got hit by the same bot again today. Guess what! No IP defined for the rDNS of the same ip.<br />
They show how to identify the bot on their site <a href="http://about.ask.com/en/docs/about/webmasters.shtml#21">in webmaster&#8217;s guide</a> but they are not even properly configured ! What a shame.<br />
I&#8217;ve just sent an email to their dns admin about this. Let&#8217;s see what will be their response. Or will I ever get a response back.<br />
I&#8217;ll post the updates&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://livebookmark.net/journal/2007/04/18/askcom-and-their-bot/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sitemaps in the robots.txt Happy Harvesting</title>
		<link>http://livebookmark.net/journal/2007/04/11/sitemaps-in-the-robotstxt-happy-harvesting/</link>
		<comments>http://livebookmark.net/journal/2007/04/11/sitemaps-in-the-robotstxt-happy-harvesting/#comments</comments>
		<pubDate>Thu, 12 Apr 2007 04:40:53 +0000</pubDate>
		<dc:creator>Harun Yayli</dc:creator>
				<category><![CDATA[Ask]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hacks]]></category>
		<category><![CDATA[Ideas]]></category>
		<category><![CDATA[Internet]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Sitemaps]]></category>
		<category><![CDATA[Web Standards]]></category>
		<category><![CDATA[Yahoo]]></category>

		<guid isPermaLink="false">http://livebookmark.net/journal/2007/04/11/sitemaps-in-the-robotstxt-happy-harvesting/</guid>
		<description><![CDATA[I&#8217;ve just read the Google Webmaster&#8217;s blog about the news on ask.com supporting Sitemaps.org&#8217;s sitemap format.
This is really a great news for all the people that like to be crawled faster and acurately.
For me the more interesting part about this news is that sitemaps.org&#8217;s proposal to include sitemaps into the robots.txt.
Simply you add a line [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve just read the Google Webmaster&#8217;s blog about the news on ask.com supporting Sitemaps.org&#8217;s sitemap format.<br />
This is really a great news for all the people that like to be crawled faster and acurately.<br />
For me the more interesting part about this news is that sitemaps.org&#8217;s proposal to include sitemaps into the robots.txt.</p>
<p>Simply you add a line into your robots.txt saying</p>
<blockquote><p>
Sitemap: &lt;sitemap_location&gt;&lt;/sitemap_location&gt;
</p></blockquote>
<p>This part is really cool but for site harvesters it&#8217;s an unbelivable tool. So you can handover the key to your site and web harvesters can crawl your site really easily because probably you&#8217;ve put all your site&#8217;s pages into your sitemap.</p>
<p>Sounds like a good plan in an ideal world. With all the cloakers and content scrapers you must be really smart not to be ripped apart.</p>
<p>My suggestion is to know who you&#8217;re serving the sitemap. Currently Google, Yahoo and Ask is supporting this sitemaps.xml and no other site has anything to do with it.<br />
Here is a simple check you can add in the begining of your sitemap thing:</p>
<p><code>
<pre>
< ?php
    function botIsAllowed($ip){
        //get the reverse dns of the ip.
        $host = strtolower(gethostbyaddr($ip));
        $botDomains = array('.inktomisearch.com',
                                     '.googlebot.com',
                                     '.ask.com',
                             );

        //search for the reverse dns matches the white list
        foreach($botDomains as $bot){
            if (strpos(strrev($host),strrev($bot))===0){
                $qip= gethostbyname($host);
                return ($qip==$ip);
            }
        }
        return false;
}

if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
    echo "Banned!";
    exit;
}
?>
</pre>
<p></code></p>
<p>I&#8217;m sure everyone can get the idea of reverse dns and forward dns checking.<br />
If I missed any decent site that uses the sitemaps let me know.</p>
<p>Note: If you&#8217;re still using static sitemaps (!) you can just include the xml after the code. </p>
]]></content:encoded>
			<wfw:commentRss>http://livebookmark.net/journal/2007/04/11/sitemaps-in-the-robotstxt-happy-harvesting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

