Archive for April, 2007

Full text index and other complex indexes together

on Monday, April 30th, 2007

Always remember, if you’re using MyIsam tables, KEEP THE INDEX AS SMALL AS POSSIBLE.

That means, your full text index will be huge and you don’t want to keep them together with other types of index for faster queries.

Simply create another table to have the text fields and full text indexes and left join. It works way faster for me on a table with 230K rows.

Data take about 500MB and the indexes take about 700MB. I don’t think I over indexed the table because of the delicate business logic. Even for a simple 1 field query (with normal index on it) it was taking more than 30secs. I removed the table with out the full text indexes and now it takes 1 sec.
Wow, a good thing to remember. A mental note to myself again.

Thoughts Aside:
The process of data security is the best computer backup solution and it is the way of ensuring that backup computer files are kept protected from corruption with free software and that access to it is properly controlled on a back up server. It also facilitates the process of online file sharing. The globalization of data recovery hardware networking is a facility with the help of which your stock can be sourced from anywhere in the world by doing up with your online storage.

More sitemap issues!

on Thursday, April 19th, 2007

It’s a jungle out there!
Now I realized, yahoo is indexing the sitemap directly at the search results!!!!!!!!
this is redicilous.
Check this link the last entry is from my sitemap.

Ask.com and Their bot!

on Wednesday, April 18th, 2007

After their initiative to put the sitemaps into the robots.txt I recently posted about how to identify robots to server the sitemap or not. I believe it’s extremely important for the webmaster’s to protect themselves from site scrapers. Sitemaps in the robots.txt is like a highway sign pointing the easy way to scrape a site.
With this idea in my mind, I also added my sitemap a small snippet to check the bot if it’s from a company that I like to serve the sitemap.
Read more…

Sitemaps in the robots.txt Happy Harvesting

on Wednesday, April 11th, 2007

I’ve just read the Google Webmaster’s blog about the news on ask.com supporting Sitemaps.org’s sitemap format.
This is really a great news for all the people that like to be crawled faster and acurately.
For me the more interesting part about this news is that sitemaps.org’s proposal to include sitemaps into the robots.txt.

Simply you add a line into your robots.txt saying

Sitemap: <sitemap_location></sitemap_location>

This part is really cool but for site harvesters it’s an unbelivable tool. So you can handover the key to your site and web harvesters can crawl your site really easily because probably you’ve put all your site’s pages into your sitemap.

Sounds like a good plan in an ideal world. With all the cloakers and content scrapers you must be really smart not to be ripped apart.

My suggestion is to know who you’re serving the sitemap. Currently Google, Yahoo and Ask is supporting this sitemaps.xml and no other site has anything to do with it.
Here is a simple check you can add in the begining of your sitemap thing:

< ?php
    function botIsAllowed($ip){
        //get the reverse dns of the ip.
        $host = strtolower(gethostbyaddr($ip));
        $botDomains = array('.inktomisearch.com',
                                     '.googlebot.com',
                                     '.ask.com',
                             );

        //search for the reverse dns matches the white list
        foreach($botDomains as $bot){
            if (strpos(strrev($host),strrev($bot))===0){
                $qip= gethostbyname($host);
                return ($qip==$ip);
            }
        }
        return false;
}

if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
    echo "Banned!";
    exit;
}
?>

I’m sure everyone can get the idea of reverse dns and forward dns checking.
If I missed any decent site that uses the sitemaps let me know.

Note: If you’re still using static sitemaps (!) you can just include the xml after the code.

Goatse Can Get You Jailtime in the US, ouch!

on Monday, April 9th, 2007

I’ve just read a post about the goatse man can cause you get a jail time in the US for posting it on a board or site with a fake title. I’ll not link here the infamous photo here sorry :)

Here is the related US code:
Read more…