April 23, 2003

Grub, Google and the Semantic Web

Dr. Elwyn Jenkins penned a good overview discussing Grub, Google and search engine convergence (or as extropians might extol: ‘the Singularity’).

Here’s how the numbers break down:

As of this writing, Google has 3,083,324,652 web pages indexed.

Google crawls about 150 million pages each day, so it takes them about 20 days to crawl what they claim is the web. They arguably have the largest crawling farm in existence, with access to over 100,000 processors and 263,000 hard drives (though, much of what is stored on them are their other services like Newsgroups, Images, Cached sites, etc.). Compare that too Grub:

2,193 clients running Grub crawled 124,456,219 URLs in the last 24 hours.

The Grub client on my computer alone has crawled a total of 150,000 URLs in the past day and a half; I have Grub running on another computer and it has racked up 85,000 URLs in the same amount of time (it was disconnected for several hours – the monkeys mashed too many bananas on the keyboard… you know how that is).

Additionally, Grub’s FAQ section states that there is around 10 billion web pages in existence and about 2 million more added each day.

If we want real-time indexing let's look at the necessary numbers to do so:

* ~30 days in a solar month
* 24 hours in a Romantic day
* 60 minutes in an Olmecian/Sumerian hour
* 60 seconds in a non-metric minute.

Now let’s assume a few things. First, everything stays the same (ceteris paribus). Second, that only 3 billion pages need to be crawled.

125,000,000 / 24 hours = 5208333.33… sites indexed in an hour
5208333.33… / 60 minutes = 86805.55… sites indexed in a minute
86805.55… / 60 seconds = 1446.759259259… sites indexed in a second

3,000,000,000 sites in existence / 1446.759259259… = 2073600 Grub users grubbing.

That 2073600 is the number of clients needed to index the web in real-time.

If 10 billion is the real number of pages, just multiply 2073600 by 3.33 = 6912000 users.

So, we’re talking about numbers in the seti@home and Kazaa ball-park, which is assuming that nothing changes (like more efficient code or bandwidth allocation).

Here are a couple ideas (some of these may be planned by Looksmart already; I am unaware of them however).

- First, decentralize the servers using a supernode/shard based system like Kazaa has (or the DNS system does). One benefit of this is bandwidth based: no ‘one’ entity is in control of it, so if the servers go down, there are alternatives (as well as the fact that search times could be minimized as they are routed through the nearest node).

This decentralized system could be protocol/standards based, so other companies, organizations and mimes could attach their own results (so an RDF-only crawler could merge with the system – which assists the growth of the Semantic Web).

- Second, more APIs. I’m sure this is being worked on currently but having the ability to search with Grub as you do with the Google Toolbar would be gravy. Actually, all Google has to do is adopt the same sort of system Grub has (it’s open-sourced so they should at least try it out and explain why it’s an inferior solution to their system). If Google did evolve to use this distributed crawling method, they already have a large library of useful APIs to use (instead of having Looksmart or others reinvent the wheel).

- Third, tell the world, or at least geekdom, what you plan on doing with the project in the long-term. So far many geeks think you’re just trying to use individuals like myself as a tool - I certainly hope that is not the case (I doubt it is).

Yup, the distributed computing approach utilized by Grub is innovative and even exciting, I look forward to seeing where it will evolve.

One last note, Google just acquired Applied Semantics, who’s patented CIRCA Technology: understands, organizes, and extracts knowledge from websites and information repositories in a way that mimics human thought and enables more effective information retrieval. A key application of the CIRCA technology is Applied Semantics' AdSense product that enables web publishers to understand the key themes on web pages in order to deliver highly relevant and targeted advertisements.

Even if it is just for ads, this buyout shows their is a market for AI agents that can effectively understand what is being discussed on a website – making RDF and OWL that much more important.

Posted by Tim at April 23, 2003 06:50 PM | TrackBack
Comments

well if we can't use you as a tool then what we can we use you for? :P

Posted by: gnome-girl at April 23, 2003 07:59 PM

Hmm, well I suppose if I was tool I'd be a Stud Detector because, yes, I'm a red headed stud.

Posted by: Tim at April 23, 2003 08:05 PM
Post a comment









Remember personal info?