[Up- imhofaq] [Map] [Prior: LISP] [Robot Wisdom home page]

Natural Language Translation, Natural Language Processing (NLP)

Jorn Barger (updated slightly July 2000)

New: Wired special issue on NLP includes detailed, readable history

On first thought, you'd expect that a dictionary of word-substitutions ought to take us the best part of the way to comprehensible translated text. In fact, though, any brief test of this idea shows it to be massively undermined by two difficulties: 1) many idiomatic phrases don't work at all the same in literal translation (e.g., "I give you my word"), and 2) the largest percentage of words allow several entirely different meanings. AI researchers' first bouts with this took place between 1956 and 1966, when the ALPAC report killed (for the time being) all government funding for translation research.


Babelfish translation from:

Transparent Language translation from:


A slightly more evolved approach focuses on 'parsing' the syntactic structure of each sentence (ie, constructing a sentence-diagram) as a way of disambiguating shades of meaning, by eliminating (at least) those shadings that imply an impossible part-of-speech. Try-it-yourself parser

This school of thought continues to try and add more and more complex algorithms for finding more and more subtle syntactic patterns... but less effort has been put into trying to collect a huge dictionary of the (idiomatic) patterns themselves, probably because the latter task de-emphasizes the element of 'programmer macho' (a factor that steers research directions much more than it ought!).

The poverty of current NLP can easily be seen by exploring the pathetic grammar-checkers offered, e.g., with Microsoft Word! (It's hard to evaluate this one without some preparation...) The minimalist parsers familiar from Infocom-style text-adventure games offer about as much grammatical sophistication as one should expect from algorithms alone. Several toolkits for adventure-game development are available on the Net, allowing one to experiment with parsers and their limitations. Front-ends for databases are another target-domain for parser research - reducing the amount of structure required in "structured query languages".

Emacs, with its facilities for 'grepping' complex patterns expressed as 'regular expressions', is another useful tool for NLP experiments. Griswold's text languages, SNOBOL and Icon, are similarly useful. Another direction is offered by SGML, TEI, and HTML, three related projects exploiting additional layers of 'markup' within text documents. Automated document analysis can be given an easy boost if the creators of the documents add some signposts to the content via SGML markup.

The Text Encoding Initiative (TEI) has been working out detailed conventions for marking up various classes of literary text, using the Structured Generalized Markup Language (SGML). SGML markup looks like <emphasis>this</emphasis>. The HyperText Markup Language (HTML) is a simple form of SGML to support hypertext linkages within and between documents, and has gained great success via the WorldWide Web project.

The first round of speech-understanding research was funded by DARPA until 1976, when it became clear that no quick solution was emerging. The state of the art is still limited vocabulary spoken by a single user, and probably can't do better until improved language-understanding allows the software to predict which words are likeliest. An accurate mechanical model of speech production would help, too. Handwriting recognition has done a little better, since the time-path of the stylus can now be tracked, but again the big improvements depend on word-prediction.

One important subfield of NLP ought to be focusing on the meanings (especially the emotions) carried by rhythms and tones in ordinary speech. I don't know how far this has gotten - "Sentics" by Manfred Clynes was an interesting half-baked first attempt.

Natural language generation is largely the domain of ELIZA (1966, Weizenbaum) and RACTER (1984, Etter and Chamberlain) and their successors, the chatterbots. The annual Loebner competition for such programs, the closest thing to a real "Turing test", was won in 1992 and 1993 by Joe Weintraub's PC Politician. The remarkable thing about these efforts is their occasionally uncanny successes at mimicking intelligence, despite absurdly primitive knowledgebases. Loebner's home page

A decent chatterbot that doesn't require Java: http://orlo.emi.net/html/brainframe.htm

The Economist on chatterbots.

BotSpot's survey of chatterbots

Thomas Whalen's 1994 winner can be explored via telnet debra.dgbt.doc.ca 3000 (choose the Sex Expert) while his other work can be explored on the WWWeb.

Kenneth Colby, author of Parry (the first NLP schizophrenic) has a WWWeb site promoting an NLP program called 'Overcoming Depression'.


Resources

comp.ai.nat-lang newsgroup

The standard text on NLP is James F. Allen's "Natural Language Understanding", Addison-Wesley 1988 (A new edition is imminent.)

You can FTP some free Macintosh parsing tools at hjelmslev.ling.gu.se/pub/li

WordNet, a richly interconnected hyper-thesaurus experiment, is available on the WWWeb at: http://www.cogsci.princeton.edu/~wn/

rec.arts.int-fiction is a newsgroup for adventure game programmers.

The text-adventure archives are at: ftp.gmd.de TADS is the most popular platform.
Interactive Fiction home page For emacs, see above.

Icon for the Mac is available at: ftp://cs.arizona.edu/icon/ in library/bipl.hqx (the icon program library - sample procedures and programs) library/info.hqx packages/macintosh/met.hqx (the executables of Icon)

Newsgroup: comp.text.sgml

An attempt to add AI to the WWWeb via Common LISP.

Corpora links page

comp.speech

doctor.el is an implementation of Eliza that comes standard with GNU-Emacs. Invoke it with M-x doctor.

M-x psychoanalyze-pinhead is also amusing, pitting Eliza against Zippy the Pinhead.

AI_ATTIC is an anonymous ftp collection of classic AI programs and other information maintained by the University of Texas at Austin. It includes Parry (hi, Kibo!), Adventure, Shrdlu, Doctor, Eliza, Animals, Trek, Zork, Babbler, Jive, and some AI-related programming languages. This archive is available by anonymous ftp from U of Texas

For a home page on computer-generated writing

For a FAQ on RACTER.


[Up: imhofaq] [Map] [Next: hardware] [Robot Wisdom home page] (Feedback)

Search the Robot Wisdom pages:

Before you leave this site: Be sure you've checked out Jorn's weblog which offers daily updates on the best of the Web-- news etc, plus new pages on this site. See also the overview of the hundreds of pages of original content offered here, and the offer for a printed version of the site.