Document DNA: Distributed Content-Centered Provenance Data Tracking

This thesis presents a new content-centered approach to provenance data tracking: the Document DNA. Knowledge workers are overwhelmed as they find it hard to structure, maintain, and find re-used content within their digital workspace. This issue is aggravated by the growing amount of digital data knowledge workers need to maintain. This thesis introduces a concept for tracing the evolution of text-based content across documents in the digital work space, without the need for a centralized tracking system. Our concept is inspired by the DNA common to life forms. We present an analysis and comparison of research undertaken to support knowledge workers and review provenance data tracking systems. Provenance data has been used for data security, databases and to track knowledge workers' interactions with digital content. However, very little research is available on the usefulness of provenance data for knowledge workers. Furthermore, current provenance data research is based on central systems and tracks provenance at the file level. We conducted three user studies to explore current issues knowledge workers face when working with digital content. The first study examined current knowledge workers' problems when re-using digital content. The second study examined to what extend the issues detected in our first study are addressed by document management systems. We found that document management systems do not fully address these issues, and that not all knowledge workers make use of the document management system available to them. The third study examined reasons for low user saturation of available document management systems. As a result of these three studies we identified task categories and a variety of related issues. Driven by these findings, we developed a conceptual model for Document DNA, which tracks the provenance of data used in the identified tasks. To show the effectiveness of our approach, we created a software prototype and conducted a realistic user study. Our software prototype is a Microsoft Word Add-In that tracks the evolution of content included in Microsoft Word documents. In our final user study, participants executed example tasks gathered from real knowledge workers with and without the support of our software prototype. The results of our study confirm that the Document DNA successfully addresses the issues identified. The participants were significantly faster when performing the tasks using the software prototype; most participants using traditional methods failed to identify the provenance of the data, whereas the majority of participants using the software prototype succeeded.
