Full text index and other complex indexes together

April 30th, 2007 by Harun Yayli | 1 Comment »

Always remember, if you’re using MyIsam tables, KEEP THE INDEX AS SMALL AS POSSIBLE.

That means, your full text index will be huge and you don’t want to keep them together with other types of index for faster queries.

Simply create another table to have the text fields and full text indexes and left join. It works way faster for me on a table with 230K rows.

Data take about 500MB and the indexes take about 700MB. I don’t think I over indexed the table because of the delicate business logic. Even for a simple 1 field query (with normal index on it) it was taking more than 30secs. I removed the table with out the full text indexes and now it takes 1 sec.
Wow, a good thing to remember. A mental note to myself again.

Thoughts Aside:
The process of data security is the best computer backup solution and it is the way of ensuring that backup computer files are kept protected from corruption with free software and that access to it is properly controlled on a back up server. It also facilitates the process of online file sharing. The globalization of data recovery hardware networking is a facility with the help of which your stock can be sourced from anywhere in the world by doing up with your online storage.

More sitemap issues!

April 19th, 2007 by Harun Yayli | No Comments »

It’s a jungle out there!
Now I realized, yahoo is indexing the sitemap directly at the search results!!!!!!!!
this is redicilous.
Check this link the last entry is from my sitemap.

Ask.com and Their bot!

April 18th, 2007 by Harun Yayli | No Comments »

After their initiative to put the sitemaps into the robots.txt I recently posted about how to identify robots to server the sitemap or not. I believe it’s extremely important for the webmaster’s to protect themselves from site scrapers. Sitemaps in the robots.txt is like a highway sign pointing the easy way to scrape a site.
With this idea in my mind, I also added my sitemap a small snippet to check the bot if it’s from a company that I like to serve the sitemap.
Read more…

Sitemaps in the robots.txt Happy Harvesting

April 11th, 2007 by Harun Yayli | No Comments »

I’ve just read the Google Webmaster’s blog about the news on ask.com supporting Sitemaps.org’s sitemap format.
This is really a great news for all the people that like to be crawled faster and acurately.
For me the more interesting part about this news is that sitemaps.org’s proposal to include sitemaps into the robots.txt.

Simply you add a line into your robots.txt saying

Sitemap: <sitemap_location></sitemap_location>

This part is really cool but for site harvesters it’s an unbelivable tool. So you can handover the key to your site and web harvesters can crawl your site really easily because probably you’ve put all your site’s pages into your sitemap.

Sounds like a good plan in an ideal world. With all the cloakers and content scrapers you must be really smart not to be ripped apart.

My suggestion is to know who you’re serving the sitemap. Currently Google, Yahoo and Ask is supporting this sitemaps.xml and no other site has anything to do with it.
Here is a simple check you can add in the begining of your sitemap thing:

< ?php
    function botIsAllowed($ip){
        //get the reverse dns of the ip.
        $host = strtolower(gethostbyaddr($ip));
        $botDomains = array('.inktomisearch.com',
                                     '.googlebot.com',
                                     '.ask.com',
                             );

        //search for the reverse dns matches the white list
        foreach($botDomains as $bot){
            if (strpos(strrev($host),strrev($bot))===0){
                $qip= gethostbyname($host);
                return ($qip==$ip);
            }
        }
        return false;
}

if (!botIsAllowed($_SERVER['REMOTE_ADDR'])){
    echo "Banned!";
    exit;
}
?>

I’m sure everyone can get the idea of reverse dns and forward dns checking.
If I missed any decent site that uses the sitemaps let me know.

Note: If you’re still using static sitemaps (!) you can just include the xml after the code.

Goatse Can Get You Jailtime in the US, ouch!

April 9th, 2007 by Harun Yayli | No Comments »

I’ve just read a post about the goatse man can cause you get a jail time in the US for posting it on a board or site with a fake title. I’ll not link here the infamous photo here sorry :)

Here is the related US code:
Read more…

Google Base Banned Keywords

March 11th, 2007 by Harun Yayli | No Comments »

I started submitting some items to google base using the API.
It was all going fine until I realized some keywords were banned from the description, title or URL.
I was posting the free articles I have, about various subjects and my purpose was not spamming.
It’s kind of weird if you want to post a news about anti-online gambling and to get banned.

It’s a stupid keyword match, not a context match. This means you can’t really post anything about those keywords even if you’re writing against or for.

Here is the list I’ve found that is banned. Maybe there are more.

Update: I’ve found this one which is really weird: iraqi dinar

Read more…

Nintendo Wii law suits

December 23rd, 2006 by Harun Yayli | No Comments »

It didn’t surprise me to read, some guys are suing Nintendo for failing to include a non-defective remote control. As far as I got, it has something to do with the breaking remote straps which were recently replaced.
Here is a pic for the replaced straps :
Read more…

New Ad Presentation from Adsense

December 21st, 2006 by Harun Yayli | No Comments »

I realized a new presentation of “Ads by Google” thing on the adsense advertisments
Looks really niceadsense new

Find Articles

November 9th, 2006 by Harun Yayli | No Comments »

Hello,
I’d like to announce my new website that I’m building right now. FindInArticles.com.
This site is about articles, it crawls all known article directories, article publishing sites and gethers articles from them.
It has a really extensive categorical index. currently I have about 1000 articles but the number is growing daily.
I used Php / Mysql / Apache to build up the site. I’m planning to add some ajax toys by the time.
You can also submit articles too.

Visit me and tell me what you think thanks…

Google Wins By Losing

November 1st, 2006 by Harun Yayli | No Comments »

Interesting article on googles non-search products and how they’re expanding their business.

Some of Google’s non-search projects are really extensions of its search monetization, and are likely to succeed. But others projects mean entering areas where Google doesn’t have much experience, and is taking a risk. With regard to those riskier areas, the key question for Google’s future is whether it can realize that losing is really one of the best assets the company has.

Read more on:
mediapost