Oxford-Illinois Digital Libraries Placement Programme project presentations

August 3, 2017 - 14:00 to 15:00
Conference room (room 278)

Oxford e-Research centre, 7 Keble Road, Oxford, OX13QG

  • Seminar
  • No booking required
  • Open to all
  • Coffee and cakes

Oxford-Illinois Digital Libraries Placement Programme project presentations.

This week's seminar reports upon the work of the two 2017 recipients of the Oxford-Illinois Digital Libraries Placement Programme positions which have taken place over the summer in the Oxford e-Research Centre and the Bodleian Libraries.

A Case Study on Theses in Oxford’s Institutional Repository:
Challenges Meeting the ISO 19005 Standard

Anna Oates

The Oxford University Research Archive (ORA) hosts rich collections created by student and faculty researchers at the University of Oxford. Among the collections ingested into ORA are student theses. Several institutions across the globe have begun requiring students to deposit their electronic theses as PDF/A (Portable Document Format-Archival) files. PDF/A was established by the International Organization of Standardization (ISO) as the ISO 19005 standard for long-term preservation of electronic documents. While the ISO requirements of a well-formed document ensure sustainability and easy recovery of content, the standard restricts some document features from being incorporated into a well-formed PDF/A. Non-conforming features including non-Latin glyphs are found across the ORA theses collection of language and scientific theses. A further complication for achieving ISO compliance is that, despite non-conformance to the ISO standard, validation tools do not always catch non-conformance errors in documents which claim to conform to PDF/A.

While PDF/A is a logical solution for long-term preservation of electronic documents, the stringent standard prevents some content which is frequently used in academic research (e.g., non-Latin glyphs) from conforming to the ISO 19005 standard. This research project investigated the format conformance of a set of born-digital and digitized theses in ORA. From this research, recommendations about tools and a policy on the use of PDF/A will emerge to ensure that student research might be digitally preserved and accessed in a non-proprietary file format. Further investigation of the format will foster a guide to best practice for the use of PDF/A in electronic theses and dissertation repositories.

Every Feature that Rises will Converge?
Towards incorporating notions of feature shape in music information retrieval

Yi-Yun Cheng (Jessica)

Features describing aspects of a musical audio signal can approximate semantic descriptions of interest to musicologists, but understanding and making good use of these features is not straightforward. Imagine a context where a musicologist wishing to conduct a harmonic analysis could be guided toward features sharing a “harmonic shape” (operating in the spectral domain), without requiring extensive signal processing background knowledge. We propose to address this issue by conveying information about feature shapes, the characteristics of the feature extraction process that are shared between different subsets of features.

The Audio Feature Ontology and Vocabulary (AFO/AFV) surveyed existing MIR feature taxonomies, enumerating a comprehensive list of audio features, and presenting process descriptions specifying the operation sequence of each feature extractor. For example, for the chromagram feature, AFO/AFV describes its operation sequence as Windowing, Discrete Fourier Transform, Logarithm, and Sum.

In this research, we use the operation sequences of the AFV/AFO to inform analytical workflows on feature data exhibiting different types of processes. We also explore the feasibility of feature shape-based filtering and querying within the Internet Live Music Archive, a large collection of audio recordings. We further consider the commonalities and divergences between the operational sequences defined by AFO/AFV and analogous processes within the Extracted Feature Dataset of the HathiTrust, a digital library containing a large collection of OCR-digitized volumes, to gain a more generic understanding of feature-shape based explorations in information retrieval.