Smith, T. C., & Witten, I. H. (1993). Language inference from function words (Computer Science Working Papers 93/3). Hamilton, New Zealand: Department of Computer Science, University of Waikato.
Permanent Research Commons link: http://hdl.handle.net/10289/9927
Language surface structures demonstrate regularities that make it possible to learn a capacity for producing an infinite number of well-formed expressions. This paper outlines a system that uncovers and characterizes regularities through principled wholesale pattern analysis of copious amounts of machine-readable text. The system uses the notion of closed-class lexemes to divide the input into phrases, and from these phrases infers lexical and syntactic information. The set of closed-class lexemes is derived from the text, and then these lexemes are clustered into functional types. Next the open-class words are categorized according to how they tend to appear in phrases and then clustered into a smaller number of open-class types. Finally these types are used to infer, and generalize, grammar rules. Statistical criteria are employed for each of these inference operations. The result is a relatively compact grammar that is guaranteed to cover every sentence in the source text that was used to form it. Closed-class inferencing compares well with current linguistic theories of syntax and offers a wide range of potential applications.
Department of Computer Science, University of Waikato
© 1993 by Tony C. Smith & Ian H. Witten
- 1993 Working Papers