Permanent URI for this collection
Browse
Recent Submissions
Publication Making oral history accessible over the World Wide Web(Working Paper, University of Waikato, Department of Computer Science, 1998-11) Bainbridge, David; Cunningham, Sally JoWe describe a multimedia, WWW-based oral history collection constructed from off-the-shelf or publicly available software. The source materials for the collection include audio tapes of interviews and summary transcripts of each interview, as well as photographs illustrating episodes mentioned in the tapes. Sections of the transcripts are manually matched to associated segments of the tapes, and the tapes are digitized. Users search a full-text retrieval system based on the text transcripts to retrieve relevant transcript sections and their associated audio recordings and photographs. It is also possible to search for photos by matching text queries against text descriptions of the photos in the collection, where the located photos link back to their respective interview transcript and audio recordings.Publication Melody based tune retrieval over the World Wide Web(Working Paper, University of Waikato, Department of Computer Science., 1998-11) Bainbridge, David; McNab, Rodger J.; Smith, Lloyd A.In this paper we describe the steps taken to develop a Web-based version of an existing stand-alone, single-user digital library application for melodical searching of a collection of music. For the three key components: input, searching, and output, we assess the suitability of various Web-based strategies that deal with the now distributed software architecture and explain the decisions we made. The resulting melody indexing service, known as MELDEX, has been in operation for one year, and the feed-back we have received has been favorable.Publication Link as you type: using key phrases for automated dynamic link generation(Working Paper, University of Waikato, Department of Computer Science, 1998-09) Jones, SteveWhen documents are collected together from diverse sources they are unlikely to contain useful hypertext links to support browsing amongst them. For large collections of thousands of documents it is prohibitively resource intensive to manually insert links into each document. Users of such collections may wish to relate documents within them to text that they are themselves generating. This process, often involving keyword searching, distracts from the authoring process and results in material related to query terms but not necessarily to the author’s document. Query terms that are effective in one collection might not be so in another. We have developed Phrasier, a system that integrates authoring (of text and hyperlinks), browsing, querying and reading in support of information retrieval activities. Phrasier exploits key phrases which are automatically extracted from documents in a collection, and uses them as link anchors and to identify candidate destinations for hyperlinks. This system suggests links into existing collections for purposes of authoring and retrieval of related information, creates links between documents in a collection and provides supportive document and link overviews.Publication Naive Bayes for regression(Working Paper, University of Waikato, Department of Computer Science., 1998-10) Frank, Eibe; Trigg, Leonard E.; Holmes, Geoffrey; Witten, Ian H.Despite its simplicity, the naïve Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates. This paper shows how to apply the naïve Bayes methodology to numeric prediction (i.e. regression) tasks, and compares it to linear regression, instance-based learning, and a method that produces “model trees” - decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naïve Bayes is the method of choice, on real-world datasets it is almost uniformly worse than model trees. The comparison with linear regression depends on the error measure: for one measure naïve Bayes performs similarly, for another it is worse. Compared to instance-based learning, it performs similarly with respect to both measures. These results indicate that the simplistic statistical assumption that naïve Bayes makes is indeed more restrictive for regression than for classification.Publication Measuring ATM traffic: final report for New Zealand Telecom(Working Paper, University of Waikato, Department of Computer Science, 1998-10) Cleary, John G.; Graham, Ian; Pearson, Murray W.; McGregor, Anthony JamesThe report describes the development of a low-cost ATM monitoring system, hosted by a standard PC. The monitor can be used remotely returning information on ATM traffic flows to a central site. The monitor is interfaces to a GPS timing receiver, which provides an absolute time accuracy of better than 1 µsec. By monitoring the same traffic flow at different points in a network it is possible to measure cell delay and delay variation in real time, and with existing traffic. The monitoring system characterises cells by a CRC calculated over the cell payload, thus special measurement cells are not required. Delays in both local area and wide-area networks have been measured using this system. It is possible to measure delay in a network that is not end-to-end ATM, as long as some cells remain identical at the entry and exit points. Examples are given of traffic and delay measurements in both wide and local area network systems, including delays measured over the Internet from Canada to New Zealand.Publication An analysis of usage of a digital library(Working Paper, University of Waikato, Department of Computer Science, 1998-06) Jones, Steve; Cunningham, Sally Jo; McNab, Rodger J.As experimental digital library testbeds gain wider acceptance and develop significant user bases, it becomes important to investigate the ways in which users interact with the systems in practice. Transaction logs are one source of usage information, and the information on user behaviour can be culled from them both automatically (through calculation of summary statistics) and manually (by examining query strings for semantic clues on search motivations and searching strategy). We conduct a transaction log analysis on user activity in the Computer Science Technical Reports Collection of the New Zealand Digital Library, and report insights gained and identify resulting search interface design issues.Publication Proceedings of CBISE ’98 CaiSE*98 workshop on component based information systems engineering(Working Paper, University of Waikato, Department of Computer Science, 1998-06) Grundy, John C.Component-based information systems development is an area of research and practice of increasing importance. Information Systems developers have realised that traditional approaches to IS engineering produce monolithic, difficult to maintain, difficult to reuse systems. In contrast, the use of software components, which embody data, functionality and well-specified and understood interfaces, makes interoperable, distributed and highly reusable IS components feasible. Component-based approaches to IS engineering can be used at strategic and organisational levels, to model business processes and whole IS architectures, in development methods which utilise component-based models during analysis and design, and in system implementation. Reusable components can allow end users to compose and configure their own Information Systems, possibly from a range of suppliers, and to more tightly couple their organisational workflows with their IS support. This workshop proceedings contains a range of papers addressing one or more of the above issues relating to the use of component models for IS development. All of these papers were refereed by at least two members of an international workshop committee comprising industry and academic researchers and users of component technologies. Strategic uses of components are addressed in the first three papers, while the following three address uses of components for systems design and workflow management. Systems development using components, and the provision of environments for component management are addressed in the following group of five papers. The last three papers in this proceedings address component management and analysis techniques. All of these papers provide new insights into the many varied uses of component technology for IS engineering. I hope you find them as interesting and useful as I have when collating this proceedings and organising the workshop.Publication An entropy gain measure of numeric prediction performance(Working Paper, University of Waikato, Department of Computer Science, 1998-05) Trigg, Leonard E.Categorical classifier performance is typically evaluated with respect to error rate, expressed as a percentage of test instances that were not correctly classified. When a classifier produces multiple classifications for a test instance, the prediction is counted as incorrect (even if the correct class was one of the predictions). Although commonly used in the literature, error rate is a coarse measure of classifier performance, as it is based only on a single prediction offered for a test instance. Since many classifiers can produce a class distribution as a prediction, we should use this to provide a better measure of how much information the classifier is extracting from the domain. Numeric classifiers are a relatively new development in machine learning, and as such there is no single performance measure that has become standard. Typically these machine learning schemes predict a single real number for each test instance, and the error between the predicted and actual value is used to calculate a myriad of performance measures such as correlation coefficient, root mean squared error, mean absolute error, relative absolute error, and root relative squared error. With so many performance measures it is difficult to establish an overall performance evaluation. The next section describes a performance measure for machine learning schemes that attempts to overcome the problems with current measures. In addition, the same evaluation measure is used for categorical and numeric classifier.Publication Experiences with a weighted decision tree learner(Working Paper, University of Waikato, Department of Computer Science, 1998-05) Cleary, John G.; Trigg, Leonard E.Machine learning algorithms for inferring decision trees typically choose a single “best” tree to describe the training data. Recent research has shown that classification performance can be significantly improved by voting predictions of multiple, independently produced decision trees. This paper describes an algorithm, OB1, that makes a weighted sum over many possible models. We describe one instance of OB1, that includes all possible decision trees as well as naïve Bayesian models. OB1 is compared with a number of other decision tree and instance based learning algorithms on some of the data sets from the UCI repository. Both an information gain and an accuracy measure are used for the comparison. On the information gain measure OB1 performs significantly better than all the other algorithms. On the accuracy measure it is significantly better than all the algorithms except naïve Bayes which performs comparably to OB1.Publication Managing multiple collections, multiple languages, and multiple media in a distributed digital library(Working Paper, University of Waikato, Department of Computer Science, 1998-05) Witten, Ian H.; McNab, Rodger J.; Jones, Steve; Cunningham, Sally Jo; Bainbridge, David; Apperley, MarkManaging the organizational and software complexity of a comprehensive digital library presents a significant challenge. Different library collections each have their own distinctive features. Different presentation languages have structural implications such as left-to-right writing order and text-only interfaces for the visually impaired. Different media involve different file formats, and-more importantly-radically different search strategies are required for non-textual media. In a distributed library, new collections can appear asynchronously on servers in different parts of the world. And as searching interfaces mature from the command-line era exemplified by current Web search engines into the age of reactive visual interfaces, experimental new interfaces must be developed, supported, and tested. This paper describes our experience, gained from operating a substantial digital library service over several years, in solving these problems by designing an appropriate software architecture.Publication An evaluation of passage-level indexing strategies for a technical report archive(Working Paper, University of Waikato, Department of Computer Science., 1998-04) Williams, MichaelPast research has shown that using evidence from document passages rather than complete documents is an effective way of improving the precision of full-text database searches. However, passage-level indexing has yet to be widely adopted for commercial or online databases. This paper reports on experiments designed to test the efficacy of passage-level indexing with a particular collection of a full-text online database, the New Zealand Digital Library. Discourse passages and word-window passages are used for the indexing process. Both ranked and Boolean searching are used to test the resulting indexes. Overlapping window passages are shown to offer the best retrieval performance with both ranked and Boolean queries. Modifications may be necessary to the term weighting methodology in order to ensure optimal ranked query performance.Publication Predicting apple bruising relationships using machine learning(Working Paper, University of Waikato, Department of Computer Science, 1998-04) Holmes, Geoffrey; Cunningham, Sally Jo; Dela Rue, B. T.; Bollen, A. F.Many models have been used to describe the influence of internal or external factors on apple bruising. Few of these have addressed the application of derived relationships to the evaluation of commercial operations. From an industry perspective, a model must enable fruit to be rejected on the basis of a commercially significant bruise and must also accurately quantify the effects of various combinations of input features (such as cultivar, maturity, size, and so on) on bruise prediction. Input features must in turn have characteristics which are measurable commercially; for example, the measure of force should be impact energy rather than energy absorbed. Further, as the commercial criteria for acceptable damage levels change, the model should be versatile enough to regenerate new bruise thresholds from existing data. Machine learning is a burgeoning technology with a vast range of potential applications particularly in agriculture where large amounts of data can be readily collected [1]. The main advantage of using a machine learning method in an application is that the models built for prediction can be viewed and understood by the owner of the data who is in a position to determine the usefulness of the model, an essential component in a commercial environment.Publication New foundations for Z(Working Paper, University of Waikato, Department of Computer Science, 1998-03) Henson, Martin C.; Reeves, SteveWe provide a constructive and intensional interpretation for the specification language Z in a theory of operations and kinds T. The motivation is to facilitate the development of an integrated approach to program construction. We illustrate the new foundations for Z with examples.Publication A logic for the schema calculus(Working Paper, University of Waikato, Department of Computer Science, 1998-03) Henson, Martin C.; Reeves, SteveIn this paper we introduce and investigate a logic for the schema calculus of Z. The schema calculus is arguably the reason for Z’s popularity but so far no true calculus (a sound system of rules for reasoning about schema expressions) has been given. Presentations thus far have either failed to provide a calculus (e.g. the draft standard [3]) or have fallen back on informal descriptions at a syntactic level (most text books e.g. [7]). Once the calculus is established we introduce a derived equational logic which enables us to formalise properly the informal notations of schema expression equality to be found in the literature.Publication Revising Z: semantics and logic(Working Paper, University of Waikato, Department of Computer Science, 1998-03) Henson, Martin C.; Reeves, SteveWe introduce a simple specification logic Zc comprising a logic and semantics (in ZF set theory). We then provide an interpretation for (a rational reconstruction of) the specification language Z within Zc. As a result we obtain a sound logic for Z, including the schema calculus. A consequence of our formalisation is a critique of a number of concepts used in Z. We demonstrate that the complications and confusions which these concepts introduce can be avoided without compromising expressibility.Publication VQuery: a graphical user interface for Boolean query Specification and dynamic result preview(Working Paper, University of Waikato, Department of Computer Science, 1998-03) Jones, SteveTextual query languages based on Boolean logic are common amongst the search facilities of on-line information repositories. However, there is evidence to suggest that the syntactic and semantic demands of such languages lead to user errors and adversely affect the time that it takes users to form queries. Additionally, users are faced with user interfaces to these repositories which are unresponsive and uninformative, and consequently fail to support effective query refinement. We suggest that graphical query languages, particularly Venn-like diagrams, provide a natural medium for Boolean query specification which overcomes the problems of textual query languages. Also, dynamic result previews can be seamlessly integrated with graphical query specification to increase the effectiveness of query refinements. We describe VQuery, a query interface to the New Zealand Digital Library which exploits querying by Venn diagrams and integrated query result previews.Publication Generating accurate rule sets without global optimization(Working Paper, University of Waikato, Department of Computer Science, 1998-01) Frank, Eibe; Witten, Ian H.The two dominant schemes for rule-learning, C4.5 and RIPPER, both operate in two stages. First they induce an initial rule set and then they refine it using a rather complex optimization stage that discards (C4.5) or adjusts (RIPPER) individual rules to make them work better together. In contrast, this paper shows how good rule sets can be learned one rule at a time, without any need for global optimization. We present an algorithm for inferring rules by repeatedly generating partial decision trees, thus combining the two major paradigms for rule generation-creating rules from decision trees and the separate-and-conquer rule-learning technique. The algorithm is straightforward and elegant: despite this, experiments on standard datasets show that it produces rule sets that are as accurate as and of similar size to those generated by C4.5, and more accurate than RIPPER’s. Moreover, it operates efficiently, and because it avoids postprocessing, does not suffer the extremely slow performance on pathological example sets for which the C4.5 method has been criticized.Publication Boosting trees for cost-sensitive classifications(Working Paper, University of Waikato, Department of Computer Science, 1998-01) Ting, Kai Ming; Zheng, ZijianThis paper explores two boosting techniques for cost-sensitive tree classification in the situation where misclassification costs change very often. Ideally, one would like to have only one induction, and use the induced model for different misclassification costs. Thus, it demands robustness of the induced model against cost changes. Combining multiple trees gives robust predictions against this change. We demonstrate that ordinary boosting combined with the minimum expected cost criterion to select the prediction class is a good solution under this situation. We also introduce a variant of the ordinary boosting procedure which utilizes the cost information during training. We show that the proposed technique performs better than the ordinary boosting in terms of misclassification cost. However, this technique requires to induce a set of new trees every time the cost changes. Our empirical investigation also reveals some interesting behavior of boosting decision trees for cost-sensitive classification.