Identifying hierarchical structure in sequences: a linear-time algorithm
Citation
Export citationNevill-Manning, C. G. & Witten, I. H. (1996). Authorship patterns in information systems. (Working paper 96/25). Hamilton, New Zealand: University of Waikato, Department of Computer Science.
Permanent Research Commons link: https://hdl.handle.net/10289/1186
Abstract
This paper describes an algorithm that infers a hierarchical structure from a sequence of discrete symbols by replacing phrases which appear more than once by a grammatical rule that generates the phrase, and continuing this process recursively. The result is a hierarchical representation of the original sequence. The algorithm works by maintaining two constraints: every diagram in the grammar must be unique, and every rule must be used more than once. It breaks new ground by operating incrementally. Moreover, its simple structure permits a proof that it operates in space and time that is linear in the size of the input. Our implementation can process 10,000 symbols/second and has been applied to an extensive range of sequences encountered in practice.
Date
1996-11Type
Report No.
96/25
Collections
- 1996 Working Papers [32]