Permanent URI for this collection
Browse
Recent Submissions
Publication Text categorization using compression models(Working Paper, Department of Computer Science, University of Waikato, 2000) Frank, Eibe; Chui, Chang; Witten, Ian H.Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles—which effectively define the categories—are used as “training data” to build a model that can be used for classifying new articles that comprise the “test data.” This contrasts with “unsupervised” learning, where there is no training data and clusters of like documents are sought amongst the test articles. With supervised learning, meaningful labels (such as keyphrases) are attached to the training documents, and appropriate labels can be assigned automatically to test documents depending on which category they fall into.Publication Hierarchical document clustering using automatically extracted keyphrases(Working Paper, University of Waikato, Department of Computer Science, 2000-10) Jones, Steve; Mahoui, MalikaIn this paper we present a technique for automatically generating hierarchical clusters of documents. Our technique exploits document keyphrases as features of the document space to support clustering. In fact, we cluster keyphrases rather than documents themselves and then associate documents with keyphrase clusters. We discuss alternative measures of similarity between ‘soft-clusters’ which seed Ward’s hierarchical clustering algorithm, and present the resulting cluster hierarchies that we have produced for a large collection of scientific technical reports. We analyse the effect of the alternative similarity measures and suggest improvement to our technique.Publication A comparative transaction log analysis of two computing collections(Working Paper, University of Waikato, Department of Computer Science, 2000-07) Mahoui, Malika; Cunningham, Sally JoTransaction logs are invaluable sources of fine-grained information about users’ search behavior. This paper compares the searching behavior of users across two WWW-accessible digital libraries: the New Zealand Digital Library’s Computer Science Technical Reports collection (CSTR), and the Karlsruhe Computer Science Bibliographies (CSBIB) collection. Since the two collections are designed to support the same type of users-researchers/students in computer science a comparative log analysis is likely to uncover common searching preferences for that user group. The two collections differ in their content, however; the CSTR indexes a full text collection, while the CSBIB is primarily a bibliographic database. Differences in searching behavior between the two systems may indicate the effect of differing search facilities and content type.Publication µ-Charts and Z: Extending the translation(Working Paper, University of Waikato, Department of Computer Science, 2000-08) Reeve, Greg; Reeves, SteveThis paper describes extensions and modifications to the µ-charts as given in earlier papers of Philipps and Scholz. The charts are extended to include a command language, integer-valued signals and local integer variables. The command language is based on the syntax presented in Scholz’ thesis and the integer-valued signals and local variables are based loosely on Scholz’ earlier work. After presenting the new semantics we turn to extending the µ-charts-to-Z translation that we developed in previous work. The extensions to the translation process describe both the changes due to the extensions to the µ-charts and a modification to the translation method to more fully capture the beneficial modularisation encouraged by the µ-charts formalism. We finish by giving three complete translation examples. The paper should be read as a record of our gradual development of a Z semantics for µ-charts–hence its sometimes exploratory character or laborious explanations as we come to terms (thinking out loud) with the (sometimes very subtle) meaning of µ-charts, especially with regard to pathological and unusual examples of their use.Publication Benchmarking attribute selection techniques for data mining(Working Paper, University of Waikato, Department of Computer Science, 2000-07) Hall, Mark A.; Holmes, GeoffreyData engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted. This paper presents a benchmark comparison of several attribute selection methods. All the methods produce an attribute ranking, a useful devise of isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the rankings with respect to a learning scheme to find the best attributes. Results are reported for a selection of standard data sets and two learning schemes C4.5 and naive Bayes.Publication A development environment for predictive modelling in foods(Working Paper, University of Waikato, Department of Computer Science, 2000-07) Holmes, Geoffrey; Hall, Mark A.WEKA (Waikato Environment for Knowledge Analysis) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning/data mining algorithms. Non-programmers interact with the software via a user interface component called the Knowledge Explorer. Applications constructed from the WEKA class libraries can be run on any computer with a web browsing capability, allowing users to apply machine learning techniques to their own data regardless of computer platform. This paper describes the user interface component of the WEKA system in reference to previous applications in the predictive modeling of foods.Publication Correlation-based feature selection of discrete and numeric class machine learning(Working Paper, University of Waikato, Department of Computer Science, 2000-05) Hall, Mark A.Algorithms for feature selection fall into two broad categories: wrappers that use the learning algorithm itself to evaluate the usefulness of features and filters that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most existing filter algorithms only work with discrete classification problems. This paper describes a fast, correlation-based filter algorithm that can be applied to continuous and discrete problems. The algorithm often out-performs the well-known ReliefF attribute estimator when used as a preprocessing step for naive Bayes, instance-based learning, decision trees, locally weighted regression, and model trees. It performs more feature selection than ReliefF does-reducing the data dimensionality by fifty percent in most cases. Also, decision and model trees built from the preprocessed data are often significantly smaller.Publication One dimensional non-uniform rational B-splines for animation control(Working Paper, University of Waikato, Department of Computer Science, 2000-03) Mahoui, AbdelazizMost 3D animation packages use graphical representations called motion graphs to represent the variation in time of the motion parameters. Many use two-dimensional B-splines as animation curves because of their power to represent free-form curves. In this project, we investigate the possibility of using One-dimensional Non-Uniform Rational B-Spline (NURBS) curves for the interactive construction of animation control curves. One-dimensional NURBS curves present the potential of solving some problems encountered in motion graphs when two-dimensional B-splines are used. The study focuses on the properties of One-dimensional NURBS mathematical model. It also investigates the algorithms and shape modification tools devised for two-dimensional curves and their port to the One-dimensional NURBS model. It also looks at the issues related to the user interface used to interactively modify the shape of the curves.Publication µ-Charts and Z: hows, whys and wherefores(Working Paper, University of Waikato, Department of Computer Science, 2000-03) Reeve, Greg; Reeves, SteveIn this paper we show, by a series of examples, how the µ-chart formalism can be translated into Z. We give reasons for why this is an interesting and sensible thing to do and what it might be used for.Publication KEA: Practical automatic keyphrase extraction(Working Paper, University of Waikato, Department of Computer Science, 2000-03) Witten, Ian H.; Paynter, Gordon W.; Frank, Eibe; Gutwin, Carl; Nevill-Manning, Craig G.Keyphrases provide semantic metadata that summarize and characterize documents. This paper describes Kea, an algorithm for automatically extracting keyphrases from text. Kea identifies candidate keyphrases using lexical methods, calculates feature values for each candidate, and uses a machine learning algorithm to predict which candidates are good keyphrases. The machine learning scheme first builds a prediction model using training documents with known keyphrases, and then uses the model to find keyphrases in new documents. We use a large test corpus to evaluate Kea’s effectiveness in terms of how many author-assigned keyphrases are correctly identified. The system is simple, robust, and publicly available.Publication Interactive machine learning–letting users build classifiers(Working Paper, University of Waikato, Department of Computer Science, 2000-03) Ware, Malcolm; Frank, Eibe; Holmes, Geoffrey; Hall, Mark A.; Witten, Ian H.According to standard procedure, building a classifier is a fully automated process that follows data preparation by a domain expert. In contrast, interactive machine learning engages users in actually generating the classifier themselves. This offers a natural way of integrating background knowledge into the modeling stage–so long as interactive tools can be designed that support efficient and effective communication. This paper shows that appropriate techniques can empower users to create models that compete with classifiers built by state-of-the-art learning algorithms. It demonstrates that users–even users who are not domain experts–can often construct good classifiers, without any help from a learning algorithm, using a simple two-dimensional visual interface. Experiments demonstrate that, not surprisingly, success hinges on the domain: if a few attributes can support good predictions, users generate accurate classifiers, whereas domains with many high-order attribute interactions favor standard machine learning techniques. The future challenge is to achieve a symbiosis between human user and machine learning algorithm.Publication Text categorization using compression models(Working Paper, University of Waikato, Department of Computer Science, 2000-01) Frank, Eibe; Chui, Chang; Witten, Ian H.Text categorization, or the assignment of natural language texts to predefined categories based on their content, is of growing importance as the volume of information available on the internet continues to overwhelm us. The use of predefined categories implies a “supervised learning” approach to categorization, where already-classified articles which effectively define the categories are used as “training data” to build a model that can be used for classifying new articles that comprise the “test data”. This contrasts with “unsupervised” learning, where there is no training data and clusters of like documents are sought amongst the test articles. With supervised learning, meaningful labels (such as keyphrases) are attached to the training documents, and appropriate labels can be assigned automatically to test documents depending on which category they fall into.Publication Using compression to identify acronyms in text(Working Paper, University of Waikato, Department of Computer Science, 2000-01) Yeates, Stuart Andrew; Bainbridge, David; Witten, Ian H.Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that showed how particular kinds of lexical tokens - names, dates, locations, etc. - can be identified and located in running text, using compression models to provide the leverage necessary to distinguish different token types.