Note: This paper was written for B651 "Natural Language Processing" - a class taught by Mike Gasser at Indiana University, Fall Semester, 1999. The entire text and the program is available here zipped together. Below is the introduction and the bibliography. If you prefer another format or have any questions or comments, feel free to E-mail me.

WordCat

by Sean McLennan

December 14, 1999




1. Introduction

WordCat is a word recognition utility whose architecture was inspired by CopyCat, a system for analogy-making (Mitchell, 1993). One of CopyCat’s most unique features is that the text it is presented is a completely unanalyzed chunk. There are no control or delimiting characters — nothing has an explicit effect on processing other than those letters that are directly perceivable and manipulable by the system. I feel that this approach could be very valuable in a radically different domain — text parsing.

Since a complete text parser of this type would be a dramatic undertaking, a sub-task is required. I have chosen word recognition, what I consider to be the lowest level of text parsing of this type. By saying that WordCat “recognizes words” I mean that it is presented “noisy” text — words with typos or spelling mistakes — and WordCat supplies the closest match from its lexicon, hopefully the intended word. WordCat was implemented in C++ and has two gross functions: creating the network that is central to its functioning, and using said network for word recognition. At this point, WordCat only recognizes words in isolation, not in context.

WordCat’s performance was tested by systematically altering arbitrary words that exist in its lexicon and comparing them with its output. In general, WordCat performed well, making few unreasonable errors and I believe the problem areas could be ameliorated by further implementing CopyCat-like features.

Section 2 below discusses CopyCat and its most salient features with special attention to those that have explicit analogs in WordCat. Section 3 details the design, implementation, and use of WordCat itself and describes how it relates to CopyCat. Section 4 summarizes the results of the tests performed (full results appear in Appendix D), and finally, section 5 discusses WordCat’s advantages, limitations, promise, and applications.


References:

Elman, J. (1995) Language as as dynamical system. In R.F. Port & T. van Gelder (eds.). Mind as Motion: Explorations in the Dynamics of Cognition. Cambridge, MA: MIT Press. 195-223.

Landauer, T., and S. Dumais. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104:211-240.

Mitchell, M. (1993) Analogy-Making as Perception. Cambridge, MA: MIT Press.

Seidenberg, M. (1997) Language Acquisition and Use: Learning and Applying Probabilistic Constraints. Science, 275:1599-1603.