Building a vector space indexing engine
I've always wanted to code a search engine from scratch and discovered that it's pretty simple. Here is an example indexer I coded using Python. The first thing we need to do is have a way to take the documents we want to search on.
Thiago Saraiva

I've always wanted to code a search engine from scratch and discovered that it's pretty simple. Here is an example indexer I coded using Python. The first thing we need to do is have a way to take the documents we want to search on and turn them into a concordance. A concordance for those not in the know is a count of every word that occurs in a document.
The only other thing we need is a vector space. A vector space for those not in the know is a way of calculating the distances between two points. Essentially it works the same way as calculating the 3rd side of a triangle. Except that instead of 2 planes (x and y) or even 3 planes (x,y,z) you can have as many planes as you want. The actual idea takes a while to understand but you can read about it there, vector space search engine theory (pdf). Thankfully I already have implemented the algorithm in another project and I will just copy and paste it from here. I have modified it a little bit to avoid dividing by zero issues, and check types, and to add the above concordance method since it does belong together.
To use it just supply two concordances (one the document and the other the query) and it returns a number from 0 to 1 of how related they are. The higher the number the more relevant the search terms are to the document. So now all we need do, is take every document, build a concordance for it, then compare each one to our search terms, sort the results by number returned and we are set. The documents I decided to use are the titles and the first of some posts.
Now running it and trying some search.
Results are pretty decent I think! Now there are some problems with this technique. Firstly it doesn't support boolean searches which can be an issue, although most people tend to just type some terms. Secondly, it needs help with larger documents. The way the vector space works is based on smaller documents since they are closer to the search term space.
You can get around this by breaking larger documents up into smaller ones though. The final and biggest issue though is that it is pretty CPU intensive. I have tested a search like this with 50,000 documents and it was OK but you wouldn't want to go much further than that. It is a naive implementation though. With some caching and checking which documents are worth comparing you could take this up to millions of documents.
I remember reading somewhere (no source) that Altavista and some of the other early search engines used a technique similar to the above for calculating rankings, sop it seems the idea really can be taken to a large scale. By now I am sure someone is thinking, “Hang on, if it's that simple then why is it so hard to make the next Google?.”Well, the answer is that it's pretty easy to index, 10,000 to 100,000,000 pages it gets considerably more difficult to index 1,000,000,000+ pages. A few people have expressed difficulty getting the above and running. To do so just copy it all into a single file and run it.