How Stanford's computer science department changed the way we get information.
Peter and Maria Hoey
By Richard Brandt
The library of Alexandria in Egypt was one of the great intellectual institutions of the ancient world. For three centuries beginning around 300 B.C., the pharaoh Ptolemy and his heirs amassed virtually all the great Greek literature and philosophy, and tried to collect the whole world’s writings, from cookbooks to medical texts. As a result, the cultures of ancient civilizations changed as their scholars congregated in Alexandria to read, study and write, absorbing Greek influences.
But a library containing more than 500,000 papyrus scrolls is useless if you can’t find the ones you want. So Zenodotus, the first librarian of Alexandria, struck upon the most enduring classification system ever invented: he alphabetized the scrolls. Callimachus, one of his successors, invented the bibliography, organizing the collection into categories. The poet Philetas created the first comprehensive dictionary at the library, which Zenodotus improved by alphabetizing. Didymus wrote commentaries and glossaries of the holdings, and Dionysius Thrax created the first book on grammar.
When you’re faced with a body of knowledge many orders of magnitude larger than anything seen in history, you need to invent new ways to search, organize and study it. Fortunately, a rich intellectual environment enables creative people to rise to the challenge.
The Internet is today’s equivalent of the Alexandria library, with more than 500 billion web pages and growing. Making sense of this morass is more crucial than ever in a world that runs on information. The right—or wrong—intelligence affects decisions from running economies to going to war.
It has become obvious that search technology is the single most important application on the Internet. “The sheer size and comprehensiveness of the Internet, perhaps its greatest feature, would be useless if we didn’t have search to take advantage of it,” notes Esther Dyson, a longtime technology pundit and editor-at-large at CNET Networks.
For more than a decade, the biggest innovations in Internet search technology have come from one place—Stanford’s computer science department. Most of that work was done by graduate students under professors in the department’s database group; much of it was financed by the government-supported Digital Library Initiative—the project that gave birth to search king Google. Without those students, Internet search might be stuck in the pre-Hellenistic age.
The Digital Library Initiative was not intended to create technologies for Internet search; Stanford’s original grant proposal in 1994 made no mention of the Internet at all. The project started as an attempt by the Department of Defense to make it easier to find computer research papers electronically. Stanford and five other universities each received about $800,000 annually to collaborate. By 1998, the project’s budget and scope had grown, as the National Institutes of Health, the National Science Foundation and more universities got involved.
At its core is the PageRank system, invented by Larry Page (and named after him) while he was working on his PhD at Stanford. PageRank, which judges a site’s importance by analyzing outside links to it, was the first true innovation in search technology since the bibliography. It takes advantage of the unique properties of the web—the network of links that makes its name so apt.
Garcia-Molina, Page’s adviser, recalls how it all started. Page came into his office one day in 1995 to show him a neat trick he had discovered. The AltaVista search engine not only collected keywords from sites, but also could show what other sites linked to them. AltaVista did not exploit this link information in the way Google would, but Page suggested it would be a good way to rank sites. He reasoned that those with the most links probably were the most popular and would prove most useful to searchers: they should be listed first in the search results. He began creating his own software for analyzing links between sites.
Meanwhile Lent, the student who had all but failed Yahoo!’s business plan, had been working with Brin on a research project within the database group. In 1995, they decided to try a little associative data mining. This is the process of finding pieces of information that commonly occur together. Retailers use it to search through their sales records and determine whether different items are frequently bought at the same time by customers. (They then can place those products as far apart as possible in the store, hoping to lure customers into additional purchases.)
Brin and Lent worked on ways to find specific word combinations that often occurred together on the Internet, such as authors and their book titles. This required searching through masses of web data, so Brin wrote a “crawler” program—software that visits websites, summarizes their content and stores the data in a central location accessible to graduate students and search companies.
He intended to call the crawler “Googol,”—after the word coined by the 9-year-old nephew of mathematician Edward Kasner for the number 10100—to reflect the enormous amount of data they were collecting. For two years, Lent recalls, they did not realize they were spelling the word incorrectly.
Later, Page combined his method of analyzing “back” links pointing to a given website with Brin’s web crawler, and their combined research moved under the Digital Library umbrella.
Lent, who had a tendency to wander back and forth between university research and corporate life, did not stick around to work with Page and Brin, a decision he confesses he regrets. But in early 1996, Lent explains, “We all said, ‘There will never be another Yahoo!’” Their research seemed purely an academic exercise. Lent was itching to get back into business, so he joined a start-up company.
But the Google search engine, first set up to troll through Stanford’s own web pages, was an immediate hit with students and faculty. Page and Brin became convinced of its commercial potential. With help from Stanford’s Office of Technology Licensing and a number of professors (see sidebar) they managed to get their company funded. To bring in revenue, they borrowed an idea from GoTo.com (later renamed Overture and acquired by Yahoo!), a sort of Yellow Pages search engine that went through ads, not websites. Google now simultaneously searches through websites and its own advertisers, listing the relevant ads next to the search results. This has become the most successful advertising approach on the Internet.
Is it always that easy to start a company out of Stanford? Of course not. But, says Ullman, “The value system we have at Stanford doesn’t sneer at commercial utility.”
Not everyone agrees with that assessment. Scott Hassan, who helped Page and Brin with some of the early programming for Google while in the master’s program, thought work that showed commercial potential was discouraged at the University. “I saw people at Stanford who waited until they left to do interesting things,” he says. But, he adds, “Stanford does make it easy to buy the patents.” Hassan, who co-founded eGroups, later sold to Yahoo!, says he just didn’t realize it while he was there. “Office of Technology Licensing policies are very pro-inventor. They will even help you file the patents. But all that isn’t very well publicized at Stanford.”
Page, MS ’98, and Brin, MS ’95, may have become yet another two PhD students to disappoint their mothers by dropping out of grad school to start a company. But the research they started continues at Stanford, officially encapsulated in a project known as WebBase. Using the techniques first developed by the Google founders, the core of WebBase is a huge archive of websites now stored at the San Diego Supercomputing Center. Researchers from Stanford and other universities around the world can download and work with information about millions of websites as they develop search and retrieval technology.
Stanford has continued to supply Google with brainpower and new ideas in search. For six years, nearly everybody who graduated under a faculty adviser in the database group either stayed in academia or went to work at Google. That record was only recently broken when one alum went to IBM’s Almaden Research Center. “We used to joke that if Google went under, all our grads would be unemployed,” says Professor Jennifer Widom.
As for Lent, he has not given up. He got a call from Microsoft in 2003, telling him the company wanted “to kill Google,” he recalls. He considered joining the team, but decided that if Microsoft could do it, so could he. Lent is now an “entrepreneur in residence” at Silicon Valley venture capital firm Mohr, Davidow Ventures, putting together a start-up team that will tailor search to individuals’ interests.
Lent describes his quest as “a bit psychotic—I mean, who goes after Google?” But he thinks Google left him an opening. “I felt Google was stagnating,” he says. “Their core premise is still link analysis. But the other half of the equation is user behavior.” Lent has an algorithm he calls “Dynamic PageRank,” which adds the dimension of time to web searches in order to better determine people’s interests. How long do people stay on web pages; what hour, day or week are they most active; what ads do they most often click on; and what products do they most often buy? By tracking their interests and behavior, Lent thinks he will be able to give web searchers better results.
Because he “passed on two companies” that spun out of Stanford and became huge successes, Lent notes, “I need to give it a try. Google and Yahoo!, be warned.” Unless, of course, one of the companies becomes impressed enough to buy his start-up.
Google has already bought a company that was developing technology to personalize web searching. That company was founded—you guessed it —by a few Stanford computer science graduate students.
Glen Jeh was in the PhD program in 2003, working within the database group, when he co-wrote (with Widom) a prizewinning conference paper called “Scaling Personalized Web Search.” His approach to personalizing searches lets people specify their interests in advance. The problem is that adding individual preferences to web searches presents a difficult computational problem. Since there are millions of users, each with separate criteria, there are simply too many permutations to quickly find all the websites that simultaneously match search terms, have the highest PageRanks and correlate with their lists of interests.
Jeh, MS ’03, came up with the idea of “partial vectors,” common preferences shared by many people. Sites that match many of these preferences are given higher priority even before anyone does a search, narrowing the field. Then when an individual does a search, his or her other preferences are calculated in. That can still require a lot of expensive computing power, though, so two other PhD candidates, Taher H. Haveliwala, MS ’01, and Sepandar Kamvar, PhD ’04, improved the efficiency of calculating Jeh’s partial vectors, and the trio set up a company called Kaltix last year. Google snapped it up within months.
Some of Stanford’s computer science grads have stayed in academia, and continue to conduct research into the intricacies of web search. Junghoo Cho, MS ’97, PhD ’02, is an assistant professor at UCLA. He’s concerned about Google’s ability to alter the makeup of websites. Since a relatively small number of sites have the most links, and Google retrieves them first, those sites get visited more often and even more people link to them. Cho’s studies indicate that Google in effect drives more and more traffic to fewer and fewer sites.
Search technology research also continues at Stanford. Professor Andreas Paepcke, director of the Digital Library program, and several grad students are working on programs to search through digital photographs. Their technique combines data from the camera’s date/time stamps with information such as birthdays, holidays, vacations and major events—even data from Global Positioning System satellites—to help identify what photographs depict. This is the first step in searching through them.
Chris Manning, a professor in Stanford’s artificial intelligence group, is trying to get computers to understand “natural language,” with all its semantic subtleties, as it is used (and misused) by humans. One of Silicon Valley’s Great Tech Hopes is a “semantic web” that will allow computers employed by search engines and other sites to respond to questions written in plain English, or other languages. This is something the search site Ask Jeeves claims to do, but even Ask Jeeves executives admit their first versions were mainly a gimmick, simply picking out keywords in the questions people typed. The company is trying to improve that technology.
Stanford’s significant role as originator of search technology may be winding down, though. For one thing, this academic year will be the last for Digital Library funding. And leading research is moving into corporations, now that Google has demonstrated how profitable it can be. “We’ve been discussing the question of whether there’s anything new to do in search,” says Garcia-Molina. “With all these big companies out there, what can we do?”
Professor David Cheriton, an early investor in Google, puts it more bluntly. “When you have something like Google occur, where you can hire a bunch of great researchers all motivated by stock options, it’s hard for pure research organizations like universities to compete.”
Did anyone say, “There will probably never be another Google?”
RICHARD BRANDT is a writer and editor in San Francisco whose book, In Search of Google (O'Reilly Media) will appear later this year.
The Effort Effect
What It Takes
Bananas Are Berries?
Closing In on Cancer
What to do With Your VHS Tapes: Essential Answer
Data is from the past two weeks.