Mon, 6 August 2007
Intro: You may think Google and Yahoo have a lock on search but it may be time to starting thinking a little differently. In this podcast we take a look at some niche search sites.
Mike: Gordon, we love Google products and services - is there a the problem?
may be Google does too good of a job! Have you ever tried Google
searching on a persons name? A simple Google search on my first and
last name gives over 1.9 million results!
Today, three companies control almost 90% of online search:
50% of all searches are done using Google
25% on Yahoo
- and over 13% using Microsoft
There are some problems though – these search engines primarily give results based on the number of sites linking to a page and the prominence of search terms on a page. Because they work this way there is room for niche.
this kind of lock on search it would be almost impossible for a
startup to launch a successful general search product - right?
Yes - it would be almost impossible but we are seeing some acrivirt in the niche areas. Areas like travel and finance are niches that have already been filled but today there seems to be some room in the people search area.
Mike: Are there companies in this market we should be looking at?
of the startups to watch is Spock at www.spock.com.
Spock is scheduled for their public launch the first week of August.
Among other places on the web, Spock scans social networking websites
like Facebook and LinkedIn. Search results give summary information
(age, address, etc) about the person along with a list of website links
that refer to the person.
According to Spock 30% of the 7 billion searches done on the web every month are related to individuals. Spock says about half of those searches concern celebrities with the other half including business and personal lookups. According to Spock, a common problem that we face is that there are many people with the same name. Given that, how do we distinguish a document about Michael Jackson the singer from Michael Jackson the football player?
With billions of documents and people on the web, we need to identify and cluster web documents accurately to the people they are related to. Mapping these named entities from documents to the correct person is what Spock is all about and they’re coming at the problem in an interesting way.
Mike: I've looked at Spock - what is the Spock Challenge?
They’ve launched what they call the Spock Challenge – more formally referred to as the SPOCK Entity Resolution Problem linked here: http://challenge.spock.com/pages/learn_more
If you go to the site you can download a couple of data sets – one called a training set (approx 25,000 documents) and the other called a test set (approx 75,000 documents).
Along with the document sets they include a set of target names. You assume that each document contains only one of the target names (even though most documents contain many names). The challenge is to partition all the documents relevant to a target name by their referent.
Mike: When does the contest begin and end?It has already begun on 4/16/07. It will end on 11/16/07. On 11/16/07, Spock will run the final round of the competition and announce the winner.
Here are the dates off the website:
4/16 Registration started
5/1- 8/15 Proposal submissions accepted
7/1 Leader board live
11/1 Finalists announced
11/16 Final round at Spock, winner announced
Mike: What languages and tools be used?
You can use any language and any non-commercial libraries, tools and data to develop the solution. There is one catch - the winner grants Spock non-exclusive right to use the software and data. As an FYI, much of Google is actualy written in Python with the Search Engine Core written in C++. Python provied scripting support for the search engine. and some apps like google code are done in pythonMike: Can you give us and example of how this works?
From their website: Consider the following two documents with the target name "Michael Jackson":
Michael Jackson - The King of Pop or Wacko Jacko?
Michael Jackson statistics - pro-football-reference.com
The referents of these articles are the pop star and football player, respectively. They’ve also included the ground truth for the training set so you have something to compare against.
Once you're done training, you can run your algorithm on the test set and submit your results on this site. Spock will provide instant feedback in the form of a percentage rank score. This way you can see how you stack up against the other teams.
So they provide you with a lot of well constructed data, and the ground truth about that data. “Ground truth? data is real results and you use this information to validate your search algorithm results.
This data is documents about people, and the challenge is to determine all the unique people described in the data set. This data can be your training set. Once you have got your basic algorithm working against the training set, they let you further tune your code by running it against a second test data set and give you instant accuracy feedback in the form of a score. The score depends on how many correct unique people you can identify in the data. This way you can continue to refine your work, and see how you are doing, and how well others are doing.
This looks like a great academic challenge. At the end of the contest time, you submit your code, a 3 page description of your approach, pre-built binary executables that can run in isolation on Spock servers, and your results (the “Software Entry?). Spock will select the finalists based upon submissions, and fly the finalists to visit the judges. The winner will win $50,000, 2nd place wins $5000 and 3rd place wins $2000.
Mike: How doe people enter?
You may enter the Contest by registering online at www.spock.com/contestregistration . You may register as an individual or as a team. During the registration process, you must provide your name, your age, your email address, and the country you are from. If you are entering on behalf of an organization, a school or a company, you must identify its name. If you are registering as a team, you must provide the same information for each member of your team as well as the identity of a team leader. You will also provide a name for your team or for yourself by which you or your team will be known to other participants in the Contest. Spock may change the name if it feels the name you select is not appropriate for any reason.
Mike: What are the differences between the Spock Challenge and the Netflix Challenge?From Netflix website: The Netflix Prize (http://www.netflixprize.com ) seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes.
Winning the Netflix Prize improves Netflix ability to connect people to the movies they love.
Netflix provides you with a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.) If you develop a system that Netflix judges beats that bar on the qualifying test set they provide, you get serious money and the bragging rights. But (and you knew there would be a catch, right?) only if you share your method with Netflix and describe to the world how you did it and why it works.
In addition to the Grand Prize, we’re also offering a $50,000 Progress Prize each year the contest runs. It goes to the team whose system we judge shows the most improvement over the previous year’s best accuracy bar on the same qualifying test set. No improvement, no prize. And like the Grand Prize, to win you’ll need to share your method with us and describe it for the world.
The Netflix contest started October 2, 2006 and continues through at least October 2, 2011.
So..... back to your question - The Netflix Challenge will run another 4 years; Spock Challenge has every intention to give out the grand prize to a team with a reasonable solution at the end of the 6 months.
Netflix Chellenge sets an absolute standard for winning the grand prize; Spock Challenge intends to award to the best reasonable solution.
Mike: How about some other companies?
Wink – www.wink.com Similar to Spock – launched a few months ago. Claim that Wink People Search now searches over two hundred million people profiles. Searches people across numerous social networks including MySpace, LinkedIn, Friendster, Bebo, Live Spaces, Yahoo!360, Xanga, Twitter and more. Also included in the results are Web sources such as Wikipedia and IMDB with more coming all the time.
Zoominfo – www.zoominfo.com Specializes in executive searches. Claim 37,131,140 People and 3,518,329 Companies indexed. You can currently search on three categories – people, jobs and companies.
Searchwikia - http://search.wikia.com Jimmy Wales and his open-source search protocol and human collaboration project. From Press release:
"Last week Wikia acquired Grub, the original visionary distributed search project, from LookSmart and released it under an open source license for the first time in four years. Grub operates under a model of users donating their personal computing resources towards a common goal, and is available today for download and testing at: http://www.grub.org/ .
Grub, now open source, is designed with modularity so that developers can quickly and easily extend and add functionality, improving the quality and performance of the entire system. By combining Grub, which is building a massive, distributed user-contributed processing network, with the power of a wiki to form social consensus, the open source Search Wikia project has taken the next major step towards a future where search is open and transparent".
Direct download: niche_search_FINAL.mp3
Category:podcasts -- posted at: 6:36pm EDT