Permanent URI for this collection
Browse
Recent Submissions
Publication High precision traffic measurement by the WAND research group(Working Paper, Department of Computer Science, 1999-12) Cleary, John G.; Graham, Ian; McGregor, Anthony James; Pearson, Murray W.; Siedins, Ilze; Curtis, James; Donnelly, Stephen F.; Martens, Jed; Martin, SteleOver recent years the size and capacity of the Internet has continued its exponential growth driven by new applications and improving network technology. These changes are particularly significant in the New Zealand context where the high costs of trans-Pacific traffic has mandated that traffic be charged for by volume. This has also lead to a significant focus within the New Zealand Internet community on issues of caching and of careful planning for capacity. Approximately three years ago the WAND research group began with a program to measure ATM traffic. We were sharply constrained by cost and decided to start by reprogramming some ATM NIC cards. This paper is largely based on our experience as we have broadened this work to include IP-based non-ATM networks and the construction of our own hardware. We have learned a number of lessons in this work, rediscovering along the way some of the hard discipline that all observation scientists must submit to.Publication The Niupepa Collection: Opening the blinds on a window to the past(Working Paper, Department of Computer Science,, 1999-12) Keegan, Te Taka Adrian Gregory; Cunningham, Sally Jo; Apperley, MarkThis paper describes the building of a digital library collection of historic newspapers. The newspapers (Niupepa in Maori), which were published in New Zealand during the period 1842 to 1933, form a unique historical record of the Maori language, and of events from an historical perspective. Images of these newspapers have been converted to digital form, electronic text extracted from these, and the collection is now being made available over the Internet as a part of the New Zealand Digital Library (NZDL) project at the University of Waikato.Publication Clustering with finite data from semi-parametric mixture distributions(Working Paper, Dept. of Computer Science, University of Waikato, 1999-11) Wang, Yong; Witten, Ian H.Existing clustering methods for the semi-parametric mixture distribution perform well as the volume of data increases. However, they all suffer from a serious drawback in finite-data situations: small outlying groups of data points can be completely ignored in the clusters that are produced, no matter how far away they lie from the major clusters. This can result in unbounded loss if the loss function is sensitive to the distance between clusters. This paper proposes a new distance-based clustering method that overcomes the problem by avoiding global constraints. Experimental results illustrate its superiority to existing methods when small clusters are present in finite data sets; they also suggest that it is more accurate and stable than other methods even when there are no small clusters.Publication A compression-based algorithm for Chinese word segmentation(Working Paper, Computer Science, University of Waikato, 1999-09) Teahan, W.J.; Wen, Yingying; McNab, Rodger J.; Witten, Ian H.The Chinese language is written without using spaces or other word delimiters. Although a text may be thought of as a corresponding sequence of words, there is considerable ambiguity in the placement of boundaries. Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks: for example, full-text search, word-based compression, and keyphrase extraction. We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression. It is trained on a corpus of pre-segmented text, and when applied to new text, interpolates word boundaries so as to maximize the compression obtained. This simple and general method performs well with respect to specialized schemes for Chinese language segmentation.Publication Pace Regression(Working Paper, Computer Science, University of Waikato, 1999-09) Wang, Yong; Witten, Ian H.This paper articulates a new method of linear regression, “pace regression”, that addresses many drawbacks of standard regression reported in the literature-particularly the subset selection problem. Pace regression improves on classical ordinary least squares (OLS) regression by evaluating the effect of each variable and using a clustering analysis to improve the statistical basis for estimating their contribution to the overall regression. As well as outperforming OLS, it also outperforms-in a remarkably general sense-other linear modeling techniques in the literature, including subset selection procedures, which seek a reduction in dimensionality that falls out as a natural byproduct of pace regression. The paper defines six procedures that share the fundamental idea of pace regression, all of which are theoretically justified in terms of asymptotic performance. Experiments confirm the performance improvement over other techniques.Publication Weka: Practical machine learning tools and techniques with Java implementations(Working Paper, 1999-08) Witten, Ian H.; Frank, Eibe; Trigg, Leonard E.; Hall, Mark A.; Holmes, Geoffrey; Cunningham, Sally JoThe Waikato Environment for Knowledge Analysis (Weka) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning and data mining algorithms. Weka is freely available on the World-Wide Web and accompanies a new text on data mining [1] which documents and fully explains all the algorithms it contains. Applications written using the Weka class libraries can be run on any computer with a Web browsing capability; this allows users to apply machine learning techniques to their own data regardless of computer platform.Publication Reduced-error pruning with significance tests(Working Paper, Computer Science, University of Waikato, 1999-06) Frank, Eibe; Witten, Ian H.When building classification models, it is common practice to prune them to counter spurious effects of the training data: this often improves performance and reduces model size. “Reduced-error pruning” is a fast pruning procedure for decision trees that is known to produce small and accurate trees. Apart from the data from which the tree is grown, it uses an independent “pruning” set, and pruning decisions are based on the model’s error rate on this fresh data. Recently it has been observed that reduced-error pruning overfits the pruning data, producing unnecessarily large decision trees. This paper investigates whether standard statistical significance tests can be used to counter this phenomenon. The problem of overfitting to the pruning set highlights the need for significance testing. We investigate two classes of test, “parametric” and “non-parametric.” The standard chi-squared statistic can be used both in a parametric test and as the basis for a non-parametric permutation test. In both cases it is necessary to select the significance level at which pruning is applied. We show empirically that both versions of the chi-squared test perform equally well if their significance levels are adjusted appropriately. Using a collection of standard datasets, we show that significance testing improves on standard reduced error pruning if the significance level is tailored to the particular dataset at hand using cross-validation, yielding consistently smaller trees that perform at least as well and sometimes better.Publication The LRU*WWW proxy cache document replacement algorithm(Working Paper, 1999-06) Chang, Chung-yi; McGregor, Anthony James; Holmes, GeoffreyObtaining good performance from WWW proxy caches is critically dependent on the document replacement policy used by the proxy. This paper validates the work of other authors by reproducing their studies of proxy cache document replacement algorithms. From this basis a cross-trace study is mounted. This demonstrates that the performance of most document replacement algorithms is dependent on the type of workload that they are presented with. Finally we propose a new algorithm, LRU*, that consistently performs well across all our traces.Publication A survey of software requirements specification practices in the New Zealand software industry(Working Paper, Computer Science, University of Waikato, 1999-06) Groves, Lindsay; Nickson, Ray; Reeve, Greg; Reeves, Steve; Utting, MarkWe report on the software development techniques used in the New Zealand software industry, paying particular attention to requirements gathering. We surveyed a selection of software companies with a general questionnaire and then conducted in-depth interviews with four companies. Our results show a wide variety in the kinds of companies undertaking software development, employing a wide range of software development techniques. Although our data are not sufficiently detailed to draw statistically significant conclusions, it appears that larger software development groups typically have more well-defined software development processes, spend proportionally more time on requirements gathering, and follow more rigorous testing regimes.Publication Automating iterative tasks with programming by demonstration: a user evaluation(Working Paper, 1999-05) Paynter, Gordon W.; Witten, Ian H.Computer users often face iterative tasks that cannot be automated using the tools and aggregation techniques provided by their application program: they end up performing the iteration by hand, repeating user interface actions over and over again. We have implemented an agent, called Familiar, that can be taught to perform iterative tasks using programming by demonstration (PBD). Unlike other PBD systems, it is domain independent and works with unmodified, widely-used, applications in a popular operating system. In a formal evaluation, we found that users quickly learned to use the agent to automate iterative tasks. Generally, the participants preferred to use multiple selection where possible, but could and did use PBD in situations involving iteration over many commands, or when other techniques were unavailable.Publication Facilitating multiple copy/past operations(Working Paper, Computer science.university of Waikato, 1999-05) Apperley, Mark; Baker, Jay; Fletcher, Dale; Rogers, BillCopy and paste, or cut and paste, using a clipboard or paste buffer has long been the principle facility provided to users for transferring data between and within GUI applications. We argue that this mechanism can be clumsy in circumstances where several pieces of information must be moved systematically. In two situations - extraction of data fields from unstructured data found in a directed search process, and reorganisation of computer program source text - we present alternative, more natural, user interface facilities to make the task less onerous, and to provide improved visual feedback during the operation. For the data extraction task we introduce the Stretchable Selection Tool, a semi-transparent overlay augmenting the mouse pointer to automate paste operations and provide information to prompt the user. We describe a prototype implementation that functions in a collaborative software environment, allowing users to cooperate on a multiple copy/paste operation. For text reorganisation, we present an extension to Emacs, providing similar functionality, but without the collaborative features.Publication Browsing tree structures(Working Paper, Computer Science, University of Waikato, 1999-05) Apperley, Mark; Spence, Robert; Hodge, Stephen; Chester, MichaelGraphic representations of tree structures are notoriously difficult to create, display, and interpret, particularly when the volume of information they contain, and hence the number of nodes, is large. The problem of interactively browsing information held in tree structures is examined, and the implementation of an innovative tree browser described. This browser is based on distortion-oriented display techniques and intuitive direct manipulation interaction. The tree layout is automatically generated, but the location and extent of detail shown is controlled by the user. It is suggested that these techniques could be extended to the browsing of more general networks.Publication Feature selection for discrete and numeric class machine learning(Working Paper, Computer Science, University of Waikato, 1999-04) Hall, Mark A.Algorithms for feature selection fall into two broad categories: wrappers use the learning algorithm itself to evaluate the usefulness of features, while filters evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. Experiments using the new method as a preprocessing step for naïve Bayes, instance-based learning, decision trees, locally weighted regression, and model trees show it to be an effective feature selector - it reduces the data in dimensionality by more than sixty percent in most cases without negatively affecting accuracy. Also, decision and model trees built from the pre-processed data are often significantly smaller.Publication A diagnostic tool for tree based supervised classification learning algorithms(Working Paper, 1999-03) Holmes, Geoffrey; Trigg, Leonard E.The process of developing applications of machine learning and data mining that employ supervised classification algorithms includes the important step of knowledge verification. Interpretable output is presented to a user so that they can verify that the knowledge contained in the output makes sense for the given application. As the development of an application is an iterative process it is quite likely that a user would wish to compare models constructed at various times or stages. One crucial stage where comparison of models is important is when the accuracy of a model is being estimated, typically using some form of cross-validation. This stage is used to establish an estimate of how well a model will perform on unseen data. This is vital information to present to a user, but it is also important to show the degree of variation between models obtained from the entire dataset and models obtained during cross-validation. In this way it can be verified that the cross-validation models are at least structurally aligned with the model garnered from the entire dataset. This paper presents a diagnostic tool for the comparison of tree-based supervised classification models. The method is adapted from work on approximate tree matching and applied to decision trees. The tool is described together with experimental results on standard datasets.Publication Generating rule sets from model trees(Working Paper, 1999-03) Holmes, Geoffrey; Hall, Mark A.; Frank, EibeKnowledge discovered in a database must be represented in a form that is easy to understand. Small, easy to interpret nuggets of knowledge from data are one requirement and the ability to induce them from a variety of data sources is a second. The literature is abound with classification algorithms, and in recent years with algorithms for time sequence analysis, but relatively little has been published on extracting meaningful information from problems involving continuous classes (regression). Model trees-decision trees with linear models at the leaf nodes-have recently emerged as an accurate method for numeric prediction that produces understandable models. However, it is well known that decision lists-ordered sets of If-Then rules-have the potential to be more compact and therefore more understandable than their tree counterparts.Publication Lexical attraction for text compression(Working Paper, Computer Science, University of Waikato, 1999-01) Bach, Joscha; Witten, Ian H.New methods of acquiring structural information in text documents may support better compression by identifying an appropriate prediction context for each symbol. The method of “lexical attraction” infers syntactic dependency structures from statistical analysis of large corpora. We describe the generation of a lexical attraction model, discuss its application to text compression, and explore its potential to outperform fixed-context models such as word-level PPM. Perhaps the most exciting aspect of this work is the prospect of using compression as a metric for structure discovery in text.