http://www.newscientist.com/article.ns?id=dn6924
Google's search for meaning
11:03 28 January 2005
Exclusive from New Scientist Print Edition
Duncan Graham-Rowe
Related Articles
Machine learns games 'like a human'
24 January 2005
Cartoon characters have minds of their own
27 November 2004
Turing test is total turn-off for robots
21 August 2004
Search New Scientist
Contact us
Web Links
Dutch National Institute for Mathematics and Computer Science
Cyc Project
Automatic Meaning Discovery Using Google, arXiv.org
Computers can learn the meaning of words simply by plugging into
Google. The
finding could bring forward the day that true artificial intelligence
is
developed.
Trying to get a computer to work out what words mean - distinguish
between
"
rider" and "horse" say, and work out how they relate to each
other -
is a
long-standing problem in artificial intelligence research.
One of the difficulties has been working out how to represent knowledge
in
ways that allow computers to use it. But suddenly that is not a problem
any
more, thanks to the massive body of text that is available, ready
indexed,
on search engines like Google (which has more than 8 billion pages
indexed).
The meaning of a word can usually be gleaned from the words used around
it.
Take the word "rider". Its meaning can be deduced from the fact that
it
is
often found close to words like "horse" and "saddle". Rival
attempts to
deduce meaning by relating hundreds of thousands of words to each other
require the creation of vast, elaborate databases that are taking an
enormous amount of work to construct.
The "Google distance"
But Paul Vitanyi and Rudi Cilibrasi of the National Institute for
Mathematics and Computer Science in Amsterdam, the Netherlands,
realised
that a Google search can be used to measure how closely two words
relate to
each other. For instance, imagine a computer needs to understand what a
hat
is.
To do this, it needs to build a word tree - a database of how words
relate
to each other. It might start with any two words to see how they relate
to
each other. For example, if it googles "hat" and "head" together
it
gets
nearly 9 million hits, compared to, say, fewer than half a million hits
for
"
hat" and "banana". Clearly "hat" and "head" are
more closely related
than
"
hat" and "banana".
To gauge just how closely, Vitanyi and Cilibrasi have developed a
statistical indicator based on these hit counts that gives a measure of
a
logical distance separating a pair of words. They call this the
normalised
Google distance, or NGD. The lower the NGD, the more closely the words
are
related.
Automatic meaning extraction
By repeating this process for lots of pairs of words, it is possible to
build a map of their distances, indicating how closely related the
meanings
of the words are. From this a computer can infer meaning, says Vitanyi.
"
This is automatic meaning extraction. It could well be the way to make
a
computer understand things and act semi-intelligently," he says.
The technique has managed to distinguish between colours, numbers,
different
religions and Dutch painters based on the number of hits they return,
the
researchers report in an online preprint.
The pair's results do not surprise Michael Witbrock of the Cyc project
in
Austin, Texas, a 20-year effort to create an encyclopaedic knowledge
base
for use by a future artificial intelligence. Cyc represents a vast
quantity
of fundamental human knowledge, including word meanings, facts and
rules of
thumb.
Witbrock believes the web will ultimately make it possible for
computers to
acquire a very detailed knowledge base. Indeed, Cyc has already started
to
draw upon the web for its knowledge. "The web might make all the
difference
in whether we make an artificial intelligence or not," he says.
For exclusive news and expert analysis every week subscribe to New
Scientist
Print Edition
For what's in New Scientist magazine this week see contents
Search all stories
Contact us about this story
Sign up for our free newsletter