Loading...
Thumbnail Image
Item

Identifying hierarchical structure in sequences: a linear-time algorithm

Abstract
This paper describes an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing phrases which appear more than once by a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence. The algorithm works by maintaining two constraints: every diagram in the grammar must be unique, and every rule must be used more than once. It breaks new ground by operating incrementally. Moreover, its simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 10,000 symbols/second and has been applied to an extensive range of sequences encountered in practice.
Type
Working Paper
Type of thesis
Series
Computer Science Working Papers
Citation
Nevill-Manning, C. G. & Witten, I. H. (1996). Authorship patterns in information systems. (Working paper 96/25). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
Date
1996-11
Publisher
Degree
Supervisors
Rights