Hidden Websites
A research-focused search engine founded by Human Genome Project scientists is claiming to go where even Google doesn't tread: the deep web.
DeepDyve is designed to search the 99 percent (they say, citing a study from UC Berkeley) of hits not picked up by other search engines, which return pages based largely on interpretations of popularity and work only if a page is findable. Content hidden behind paywalls or that is not linked to enough sites to gain page rank remains obscure, but often contains the source material required for serious research.
It's the classic "needle in a haystack" problem: you know it is there, you know you can get to it, but ... how? DeepDyve attempts to bridge this gap with techniques used in genomics to identify DNA strands like pattern and symbol matching.
The company's technology uses an algorithm called “KeyPhrases” which indexes passages up to 20 words in length -- not just single key words. Since the technology was conceptualized to identify long, complex strings of DNA, there was no need for semantics, just character recognition to sequence the human genome.
“It’s really doing pattern matching; it’s not at all language dependent,” CEO William Park told wired.com. “In fact it’s actually language agnostic.”
DeepDyve’s most interesting feature, what further distinguishes it from the likes of Google Scholar, is the ability to base a search on a large chunk of text or even a whole article up to 25,000 characters. Google only lets you search 32 words.
“If you were trying to look for the sequence for blue eyes, it could be massive in length,” said Park. “The query so to speak has to be very large.”
It will scan whole strings of text to find familiar segments, rank and order them, and finally locate the most relevant article in which it is found.
“It’s purely statistical -- just like genomics,” said Park.
A subscription-based service debuted at the DEMO conference a few months ago, but on Tuesday the company launched a free ad-supported version. And it's actively seeking out new publishers to open up their content to the public using its search.
“We’re going to publishers and we’re saying let us be your iTunes partner. Let’s build a platform together where we can re-market your content in a very IP/copyright friendly way and we’re going to make your information much more findable,” Park said.
DeepDyve currently indexes about 500 million pages and partners with a number of publications for free access to their content. This quarter the company, which focuses solely on topics like health, life sciences and patents, plans on expanding its focus into physical sciences, including information technology, clean technology and energy.






