1999 Working Papers
Permanent URI for this collection
Recent Submissions
Item High precision traffic measurement by the WAND research group(Working Paper, Department of Computer Science, 1999-12) Cleary, John G.; Graham, Ian; McGregor, Anthony James; Pearson, Murray W.; Siedins, Ilze; Curtis, James; Donnelly, Stephen F.; Martens, Jed; Martin, SteleOver recent years the size and capacity of the Internet has continued its exponential growth driven by new applications and improving network technology. These changes are particularly significant in the New Zealand context where the high costs of trans-Pacific traffic has mandated that traffic be charged for by volume. This has also lead to a significant focus within the New Zealand Internet community on issues of caching and of careful planning for capacity. Approximately three years ago the WAND research group began with a program to measure ATM traffic. We were sharply constrained by cost and decided to start by reprogramming some ATM NIC cards. This paper is largely based on our experience as we have broadened this work to include IP-based non-ATM networks and the construction of our own hardware. We have learned a number of lessons in this work, rediscovering along the way some of the hard discipline that all observation scientists must submit to.Item The Niupepa Collection: Opening the blinds on a window to the past(Working Paper, Department of Computer Science,, 1999-12) Keegan, Te Taka Adrian Gregory; Cunningham, Sally Jo; Apperley, MarkThis paper describes the building of a digital library collection of historic newspapers. The newspapers (Niupepa in Maori), which were published in New Zealand during the period 1842 to 1933, form a unique historical record of the Maori language, and of events from an historical perspective. Images of these newspapers have been converted to digital form, electronic text extracted from these, and the collection is now being made available over the Internet as a part of the New Zealand Digital Library (NZDL) project at the University of Waikato.Item Clustering with finite data from semi-parametric mixture distributions(Working Paper, Dept. of Computer Science, University of Waikato, 1999-11) Wang, Yong; Witten, Ian H.Existing clustering methods for the semi-parametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finite-data situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters. This paper proposes a new distance-based clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.Item A compression-based algorithm for Chinese word segmentation(Working Paper, Computer Science, University of Waikato, 1999-09) Teahan, W.J.; Wen, Yingying; McNab, Rodger J.; Witten, Ian H.The Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.Item Pace Regression(Working Paper, Computer Science, University of Waikato, 1999-09) Wang, Yong; Witten, Ian H.This paper articulates a new method of linear regression, “pace regression”, that addresses many drawbacks of standard regression reported in the literature-particularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperforms-in a remarkably general sense-other linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.