Digital Collections as Data: An Untapped Resource for Researchers

New pathways for digital research

Librarian looking at charts, like those from collections as data

A couple years ago, I presented at an Appalachian Studies conference about how data mining and textual analysis can be used to better understand the Appalachian Studies research landscape and the representation of Appalachian Studies in library collections. At the end of my presentation, some students approached me and asked if this methodology could be used for historiographies and other history-related projects. Not having any history background, I was unsure of what a historiography was or what it was used for. After learning more about historiographies and other projects that they were considering using data mining and textual analysis for, it opened my eyes to the true potential of collections as data research. 

“Collections as data” research is an initiative based on researchers using digital collections as the dataset for analysis with the help of computational methods, such as text mining, data visualization, mapping, image analysis, audio analysis, and network analysis. For example, if a historian needs to investigate a specific topic over time, manually reviewing every written piece of history that mentions that specific topic would be a laborious and time-consuming task. Now, let’s relate this hypothetical to a concrete collection. Say that a researcher wants to investigate the history of the Cleveland Play House. This archival collection at Case Western Reserve University comprises 1681 containers (1100 linear feet) and would be a large task to investigate physically. Thankfully though, a large amount of it is available online, but to review it individually, browsing each digitized object online, would still be just as difficult. To overcome this obstacle, one would need a way to compile and export all the textual data from the collection, use a computer program to analyze the dataset, identify topics, trends, and patterns, and create a visualization of this data. 

Thankfully, there is an option to export the text and programs to complete this analysis. This is where collections as data research comes into the picture and why digital library professionals have strived to alter their procedures and workflows to ensure access to the data of a collection to users. This push from libraries and cultural heritage institutions for a data-focused approach to support this research initiative has been highly prevalent in the digital collections field. One example is the Always Already Computational: Collections as Data Report, a grant-funded project that “documented, iterated on, and shared current and potential approaches to developing cultural heritage collections that support computationally-driven research and teaching.” This report and final deliverables were created to serve as a guide for practitioners looking to develop their programs and services for collections as data research. Since this process is dependent on information professionals, it requires a fair bit of development on the end of the libraries as well, but the payoff in the end is astronomical for all fields of study.

After this so-called “epiphany” at the Appalachian Studies conference and my time working with digital collections day-in and day-out, I constantly see opportunities where these collections and the data within them can be used. From digitized rare books and manuscripts to archival collections filled with meeting minutes, reports, and more, the research potential is endless. But sometimes this potential is not outwardly seen, especially by researchers in fields that may not interact with libraries or cultural heritage institutions frequently. Despite the prevalence of digital scholarship and interdisciplinary computational methodologies, digital collections as sources of data are sometimes forgotten. 


🌟 Subscribe to the LibTech Insights newsletter for weekly roundups and bonus content, including: 


Expanding the use of digital collections

Often, researchers mistakenly believe that browsing and downloading individual items are the only practical uses of digital repositories and libraries. Certainly, viewing and downloading these digitized materials are notable uses, but they are not the only practical applications of these collections. In fact, they have a wealth of data sitting within them, beyond the PDFs, images, audiovisual files, and more. The textual datasets from the extracted optical character recognition (OCR) files, descriptive metadata, abstracts, and other supplementary files that are present across these collections can serve as an untapped wealth of data waiting to be further explored, whether it be in a classroom, research, and exploratory setting. 

Textual datasets are not the only sources of data available across digital libraries. For example, digital collections with historical maps are the perfect datasets for GIS tools. Dr. Katherine McDonough gave a recent talk at Bucknell University on “Maps as Data” that discussed the data creation methods from historical maps and how these methods can allow researchers to better understand and analyze these historical materials. Additionally, computational photography scans within digital collections, such as reflectance transformation imaging (RTI) files or more, are some of the best sources “to view interactively with the capability to adjust the virtual lighting source, and to apply filters that permit more detailed visualization and non-destructive analysis than working with the original object” (Library of Congress, 2023). A current research project, at Case Western Reserve University, is revealing details not visible in conventional imaging on a Roman Coin Collection through the use of RTI in the library’s digitization lab, allowing for the identification of these ancient coins and verification of existing data on them. 

Oral histories and audiovisual collections can serve as phonetic and phonological sources of studies across the fields of modern languages, linguistics, speech pathology, and other related areas. One example of these collections with an immense amount of data is the Linguistic Atlas Project, which is made available online through the University of Kentucky Libraries and houses one of the largest sets of regional linguistic and cultural survey data. Audiovisual collections, like historical LGBTQ radio programs (like the Gay Waves Radio Program Collection from the Western Reserve Historical Society), can provide a wealth of data for identity, sociopolitical, and sociocultural research (exemplified by another archival collection in Diversity on Display: Framing in the Gay Perspective Radio Program). Multilingual materials in a variety of formats can become corpora for all kinds of research, such as endangered language research. Historical music scores and musical recordings can give insight into pop culture throughout history and cultural events that may not be depicted in other artifacts. Museum collections can be transformed into computational datasets to better understand and analyze collections, such as the biodiversity data from Natural History Museums. There are a variety of ways that these sources of data can be used for further and more detailed research, especially with the rise of open-source and easily available data analysis tools. 

How GLAM industries are supporting collections as data

Before even considering the method of analysis for any extracted data from digital collections data, one must compile and extract the data, which sometimes is much easier said than done. This is a huge area of development within the digital collections field. Development of “collections as data” features has begun to skyrocket recently to support users. The Always Already Computational: Collections as Data Final Report brought to light current and potential approaches to developing the digital collections practice to support collections as data researchers, establishing frameworks, use cases, methodologies, and guidance for cultural heritage institutions. Galleries, libraries, archives, and museums (GLAM) have developed checklists to help create and evaluate their collections for better data use and reuse. 

Reports, checklists, and similar resources are no easy feat because achieving an output from your collections that can be used as data requires conscious and purposeful actions at every step of the digital lifecycle. Even beginning at the point of digitization, these use cases are taken into consideration for digital accessibility reasons and for research. This may require alterations to the post-processing of the digital surrogates, such as multiple file types, adjustments to the metadata, and more. Additionally, this could require the implementation of a different viewer on the user interface to make sure that researchers can fully interact with the digital objects. Sometimes, this may be as simple as ensuring that multiple file types are available. Other times, it can be more complex, such as creating a customized CSV export function catered to specific users who require a particular combination of metadata. Not to mention how the data (and even hardware and software to view the data) are then digitally preserved for future researchers to continue using, especially as media and technology rapidly change.

These are only a few examples based on my experience with archival and special collections materials, but this adjustment in workflows can be much more complex in other sectors of the GLAM field. Overall, the process of creating and making this data available is something that is being deeply evaluated across the field and greatly depends on technology, staffing, financial resources, availability, needs of users, and many other factors.  

Many features and functionalities that cater to collections as data researchers for ease of access and extraction are quite dependent on the digital library platform itself. Some repositories and digital libraries will have an export function or application programming interface (API) for this data, especially for larger quantities. These features are becoming more common in digital library platforms, both proprietary and open-source, and are great features for researchers. 

Other systems may require enlisting the help of the digital collections librarian in order to export larger datasets that are not accessible through the user interface. This is becoming more common in the digital library field and requires information professionals who work with digital collections to have a greater knowledge of database management, software, and programming. Within my work, I am frequently querying our repository’s search server (Apache Solr) to compile and export large amounts of metadata, OCR text, and more for other library departments, users, and myself. More often than not, this is because our user interface did not previously allow the gathering of large quantities of data across hierarchical levels, but this is now being fixed to allow for this type of data collection. Hence, these skills are important because it is impossible to do digital collections work without interacting with datasets everyday. 

Because of this, digital collections librarians are actually creating more and more datasets than ever before to fill this user need, making them a great resource for researchers. At the 2023 DLF Forum, this was a large topic of conversation due to the increased demand from users. This topic also brought up some frequently encountered challenges, such as storage concerns for the datasets, accessibility of the created datasets, advocacy for policies and planning, scalability, usage rights, and more. Collections as data is still a relatively new trend, making these challenges inevitable, but this also allows room for improvement to continue developing solutions, addressing shared needs, and enhancing the research potential of collections. 

Making library collections available for computational research is a prevalent trend that continues to increase across many users. While there is still innovation to come within this realm, it is imperative to reflect on the dynamic and diverse ways that digital collections can enhance research in various subject areas, settings, and fields of study. With the rate of new technology and computational methods, the possibilities in creating, extracting, analyzing, and visualizing collections from their data are further expanding, making it more important than ever to remember these hidden sources of data. 


🔥 Sign up for LibTech Insights (LTI) new post notifications and updates.

✍️ Interested in contributing to LTI? Send an email to Deb V. at Choice with your topic idea.