2000 Working Papers

Recent Submissions

Now showing 1 - 5 of 12
  • Item
    Hierarchical document clustering using automatically extracted keyphrases
    (Working Paper, University of Waikato, Department of Computer Science, 2000-10) Jones, Steve; Mahoui, Malika
    In this paper we present a technique for automatically generating hierarchical clusters of documents. Our technique exploits document keyphrases as features of the document space to support clustering. In fact, we cluster keyphrases rather than documents themselves and then associate documents with keyphrase clusters. We discuss alternative measures of similarity between ‘soft-clusters’ which seed Ward’s hierarchical clustering algorithm, and present the resulting cluster hierarchies that we have produced for a large collection of scientific technical reports. We analyse the effect of the alternative similarity measures and suggest improvement to our technique.
  • Item
    A comparative transaction log analysis of two computing collections
    (Working Paper, University of Waikato, Department of Computer Science, 2000-07) Mahoui, Malika; Cunningham, Sally Jo
    Transaction logs are invaluable sources of fine-grained information about users’ search behavior. This paper compares the searching behavior of users across two WWW-accessible digital libraries: the New Zealand Digital Library’s Computer Science Technical Reports collection (CSTR), and the Karlsruhe Computer Science Bibliographies (CSBIB) collection. Since the two collections are designed to support the same type of users-researchers/students in computer science a comparative log analysis is likely to uncover common searching preferences for that user group. The two collections differ in their content, however; the CSTR indexes a full text collection, while the CSBIB is primarily a bibliographic database. Differences in searching behavior between the two systems may indicate the effect of differing search facilities and content type.
  • Item
    µ-Charts and Z: Extending the translation
    (Working Paper, University of Waikato, Department of Computer Science, 2000-08) Reeve, Greg; Reeves, Steve
    This paper describes extensions and modifications to the µ-charts as given in earlier papers of Philipps and Scholz. The charts are extended to include a command language, integer-valued signals and local integer variables. The command language is based on the syntax presented in Scholz’ thesis and the integer-valued signals and local variables are based loosely on Scholz’ earlier work. After presenting the new semantics we turn to extending the µ-charts-to-Z translation that we developed in previous work. The extensions to the translation process describe both the changes due to the extensions to the µ-charts and a modification to the translation method to more fully capture the beneficial modularisation encouraged by the µ-charts formalism. We finish by giving three complete translation examples. The paper should be read as a record of our gradual development of a Z semantics for µ-charts–hence its sometimes exploratory character or laborious explanations as we come to terms (thinking out loud) with the (sometimes very subtle) meaning of µ-charts, especially with regard to pathological and unusual examples of their use.
  • Item
    Benchmarking attribute selection techniques for data mining
    (Working Paper, University of Waikato, Department of Computer Science, 2000-07) Hall, Mark A.; Holmes, Geoffrey
    Data engineering is generally considered to be a central issue in the development of data mining applications. The success of many learning schemes, in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. Attribute selection generally involves a combination of search and attribute utility estimation plus evaluation with respect to specific learning schemes. This leads to a large number of possible permutations and has led to a situation where very few benchmark studies have been conducted. This paper presents a benchmark comparison of several attribute selection methods. All the methods produce an attribute ranking, a useful devise of isolating the individual merit of an attribute. Attribute selection is achieved by cross-validating the rankings with respect to a learning scheme to find the best attributes. Results are reported for a selection of standard data sets and two learning schemes C4.5 and naive Bayes.
  • Item
    A development environment for predictive modelling in foods
    (Working Paper, University of Waikato, Department of Computer Science, 2000-07) Holmes, Geoffrey; Hall, Mark A.
    WEKA (Waikato Environment for Knowledge Analysis) is a comprehensive suite of Java class libraries that implement many state-of-the-art machine learning/data mining algorithms. Non-programmers interact with the software via a user interface component called the Knowledge Explorer. Applications constructed from the WEKA class libraries can be run on any computer with a web browsing capability, allowing users to apply machine learning techniques to their own data regardless of computer platform. This paper describes the user interface component of the WEKA system in reference to previous applications in the predictive modeling of foods.