Visualising categorical data: Linguistic case studies from te Reo Māori and New Zealand English

dc.contributor.advisorApperley, Mark
dc.contributor.advisorBainbridge, David
dc.contributor.advisorCalude, Andreea S.
dc.contributor.authorTrye, David
dc.date.accessioned2024-11-18T01:20:58Z
dc.date.available2024-11-18T01:20:58Z
dc.date.issued2024
dc.description.abstractCategorical variables are prevalent in real-world datasets across numerous domains, yet few visualisation techniques accommodate them effectively. This is especially true of datasets comprising three or more categorical variables, termed multivariate categorical data. Visualising such data is challenging due to the lack of inherent ordering of nominal categories, the so-called ‘curse of dimensionality’, and the potential variability in the number of categories per variable. Corpus linguistics, which involves the study of large digital collections of naturally occurring language, serves as the primary application domain in this thesis. This domain was chosen because it is rich in multivariate categorical data and, at the same time, is often visualised using only basic techniques. This thesis contributes to the area of categorical data visualisation in several ways. First, we propose a taxonomy of techniques for visualising categorical data, highlight limitations of existing solutions, and identify relevant analysis tasks. Building on this foundation, the thesis introduces novel techniques and enhancements for visualising datasets involving multiple categorical variables. We focus on adapting the layout and interactive capabilities of an existing technique that uses a matrix of heatmaps to represent pairwise category intersections. These modifications show that directly visualising statistical test results for categorical data can be beneficial for exploring bivariate patterns and associations. Furthermore, we contribute the design, implementation and evaluation of a novel technique called MultiCat, which is not restricted to pairwise intersections but rather facilitates analysis of relationships among multiple variables simultaneously. Both these techniques are interactive and offer greater scalability than existing alternatives, thereby affording new possibilities for analysing multivariate categorical data. However, since categorical variables can occur within more complex data structures, we also consider their presence in networks and hypergraphs, which require specialised methods. To demonstrate the application of these techniques, we draw on two linguistic case studies that focus on languages of special significance in Aotearoa New Zealand. Addressing the low-resource status of Māori, the country’s Indigenous language, we first contribute two related Twitter datasets—a monolingual Māori corpus and a mixed–language Māori–English corpus—together with an architecture for differentiating Māori and English words. Our initial case study uses the monolingual Māori corpus and proposed visualisation techniques to investigate grammatical possession in Māori, offering fresh insights into the linguistic practices of contemporary speakers. The second case study uses networks and hypergraphs with categorical attributes to explore Māori loanword co-occurrence in New Zealand English newspaper articles. We find that loanwords tend not to occur in isolation and that New Zealanders are still importing new (unlisted) borrowings from Māori. Ultimately, the techniques developed in this thesis have broad applications both within and beyond the corpus linguistics community. By enabling more effective visualisation and analysis of multivariate categorical data, this research has the potential to facilitate deeper insights into domains as diverse as education, healthcare, business and science.
dc.identifier.urihttps://hdl.handle.net/10289/17049
dc.language.isoen
dc.publisherThe University of Waikatoen_NZ
dc.relation.doi10.1109/IV60283.2023.00016
dc.relation.doi10.1007/s10579-022-09580-w
dc.relation.doi10.3390/languages9080271
dc.relation.doi10.1075/ijcl.21124.try
dc.relation.urihttps://aclanthology.org/2022.findings-aacl.11
dc.rightsAll items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.en_NZ
dc.subjectvisualisation
dc.subjectinformation visualisation
dc.subjectcategorical data
dc.subjectmultidimensional data
dc.subjectmultivariate data
dc.subjectnominal variables
dc.subjectordinal variables
dc.subjectte reo Māori
dc.subjectNew Zealand English
dc.subjectsocial media
dc.subjectTwitter/X
dc.subjectminority languages
dc.subjectendangered languages
dc.titleVisualising categorical data: Linguistic case studies from te Reo Māori and New Zealand English
dc.typeThesisen
dspace.entity.typePublication
pubs.place-of-publicationHamilton, New Zealanden_NZ
thesis.degree.grantorThe University of Waikatoen_NZ
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophy (PhD)
uow.thesis.typeThesis with publication

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis.pdf
Size:
26.24 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.58 KB
Format:
Item-specific license agreed upon to submission
Description: