Data Mining and Textual Analysis: A Librarian’s New Best Friend

By learning a couple tools, data mining can give you big results.

Since my academic path began in sociolinguistics, qualitative data was always my sole priority. Overwhelming quantities of text—whether in the form of corpora, interview, or other source—were never something that intimidated me. Looking back now, I clearly had not encountered the truly overwhelming and hard-to-conceptualize data that a library collects. From circulation and reference interactions to cataloging, metadata, and indexes, the data does not stop.

This was something new to me when I entered library school, and it made me curious what this data is used for. As I learned that libraries also do user research, subject analysis, and more, I could not help but wonder why data mining and textual analysis were not discussed more often. Even while working in Reference Services, I asked librarians around me if anyone did anything with the transactional data that we documented after every patron interaction. My supervisor informed me that she sometimes goes through it line by line (in a sea of hundreds of thousands of transactions), but ultimately it goes to the assessment librarian and the rest is unknown.

Why are librarians not taking advantage of the amazing benefits these tools can bring to their duties, especially when they are already recommending these techniques to patrons across multiple academic disciplines?

Where can data mining and textual analysis be used within librarianship?

I was determined to find ways to creatively analyze textual data that we see all the time in libraries without painstakingly reading a large text file line by line. With the highly customizable nature, easily accessible platforms, and understood at-a-glance visualization capabilities, data mining and textual analysis allow for more specialized and detailed evaluation of a variety of resources and services within a library, including, but not limited to, collection development, outreach, accessibility, and strategic planning. Even special projects and DEI initiatives can be enhanced through this technology to ensure representation across collections, services, and even accessibility of resources. Truly, the uses of these techniques are limited only by one’s imagination, which is why data mining and textual analysis can become a librarian’s new best friend!

What tools and technologies are available?

A spectrum of tools is available. Open-source tools would be best, but users still need to have easily located documentation, tutorials, and even forum groups/online communities available for troubleshooting, exchange of knowledge, and more. I fortunately got to explore and learn these tools through a classroom, but that will not be the case for everyone, which is why finding these communities is necessary. This process led me to VOS Viewer and RStudio.

Both tools are easily available and highly documented, but they have different interfaces. I like to think of VOS Viewer as a tool that librarians can use to “dip their toes” into data mining and textual analysis through bibliometrics until they become more confident to explore RStudio for Latent Dirichlet Allocation (LDA) modeling, BTM, and other methods.

How do I use VOS Viewer and RStudio for data mining and textual analysis?

VOS ViewerRStudio
1. Convert/Compile CSV Files
2. Open/Input the Files into VOS Viewer
3. Data is compiled, processed, analyzed, and visualized
1. Compile data and convert into CSV files (if not already)
2. Go through text-preprocessing, stemming, frequency analysis
3. Topic modeling/visualization (BTM, LDA, and more)

VOS Viewer example

VOS Viewer has wonderful visualizations and a simple interface to test out these approaches. It is a great tool to ease into learning the ways that data mining and textual analysis can help you!

For instance, if you wanted to compare the keywords used by authors and the index, you can load these datasets into VOS Viewer and compare the visualizations side-by-side. You may notice specific terms used by the author but not by the index. While the index will inevitably have more keywords, which range from vague to specific, for the purpose of creating more access points, it does show some gaps that could hinder a user from discovering these resources, especially within an academic library setting.

Below, in Figure 1, the data was compiled and extracted from the search strings “Linguistics” and “Libraries” (conducted on Scopus). We can see a variety of keywords from the author that are not present within the index (or not as prevalent). Understanding different groups and systems perceive keywords is of the utmost importance to ensure discoverability and accessibility for users.

Two concept maps showing results for search strings "linguistics" and "libraries"
Figure 1. Author keyword co-occurrences (left) vs. index keyword co-occurrences (right)

RStudio examples

RStudio gives users a bit more wiggle room for analysis and visualization. Personally, my favorite type of modeling is Biterm Topic Modeling (BTM), which I think can be one of the most useful tools for librarians, especially for looking into collection development, DEI initiatives, user research, and reference data to investigate gaps and reoccurring patterns. Within BTM, topics are extracted based on co-occurrence patterns (biterms). The extracted topics represent semantically related terms and are visualized as the many shapes within a model.

In Figure 2, the compiled and processed data was for job descriptions from an academic library. This library was beginning a strategic plan that included an investigation into its organizational structure. By taking the job descriptions of the all the librarians and making them into a CSV file (which then went through text-preprocessing, cleaning, and data mining process), the data was made into a Biterm Topic Model.

This figure shows us a lot of vague and repetitive terms that overlap between all the library positions along with their roles and responsibilities. Overlap indicates that there are terms co-occurring across the extracted topics. Additionally, in the survey data and other materials compiled into the strategic plan, there was a lot of feedback pertaining to a lack of organizational awareness. This visualization gives us the ability to physically see this issue. Figure 2 can also serve to assist the reorganization (and documentation) of these roles and responsibilities to avoid this issue in the future.

Example of overlapping terms using a BTM analysis in RStudio
Figure 2. Biterm Topic Model of strategic planning data

RStudio is great for administrative data, as seen above, but is also an amazing tool for bibliographic and citation data. One example of the versatility of this technology is for diversity, equity, inclusion, and belonging initiatives. For example, if you need to investigate the descriptors and metadata used currently in a database, catalog, etc., you can compile, preprocess, clean, and visualize the indexed bibliographic data to see if communities and fields are equally represented semantically (seen in Figure 3). Additionally, it serves to investigate the usage of inclusive descriptors to make sure that all feel seen and represented when interacting with any and all aspects of the library. Figure 3 shows a lot of overlap, but there are key areas where topics are not overlapping, such as a few of the medicine and health sciences topics. This lack of overlapping could indicate room for improvement in the representation of groups, communities, and inclusive descriptors.

Figure 3. Biterm Topic Model of digital collections

The same principles in Figure 3 can be applied to specific search strings, such as “digital humanities” AND “archives” (Figure 4). Looking into specific search strings, topics, or subject areas can give insight into the historical and current research landscape. It can also assist in locating gaps across digital resources, showing areas of improvement and how keywords are perceived. Below, one can see that literature, language, heritage materials, and a few other topics overlap, showing the prevalence of these terms co-occurring, but “academic library” does not show similar keywords, hinting that there is a gap in the collection. Understanding research landscapes, areas of improvement, and keyword perception can be critical for subject librarians, digital collections librarians, and many other departments.

A cluster of shapes and colors showing a BTM of the search string "digital humanities" AND "archives"
Figure 4. Biterm Topic Model of search string “digital humanities” AND “archives”

The last visualization that I will show you is data compiled for reference interactions that a library keeps for analytic purposes. This data can be instrumental for seeing resources, services, and departments that need improvements. It can also show trends in frequently used services, user needs, and so much more. Figure 5 demonstrates the highly customizable nature of data mining and textual analysis along with the unlimited uses that it can provide. Below, one can see that within reference, “chat” is a larger term with “email” close behind, visualizing that there is a large presence (much larger than phone or walk up) of reference interactions occurring in this manner. Additionally, this figure shows trends that patrons are frequently asking for, which may hint at increasing resources related to these terms.

Figure 5. Biterm Topic Model of reference data

Truly, if you have the textual data, you can reveal patterns, trends, and gaps that would not be visible otherwise. This more detailed evaluation of resources, services, and information that data mining and textual analysis provide can result in better informed decisions across all levels and departments of libraries.

Can this actually be useful in my role or department?

To further instill the utility of these methods, I want every reader to take a minute to think about the textual or qualitative data they interact with every day. Is it in your department, subject area, or patron community? Is it administrative, strategic planning, or library-wide initiative data? What about cataloging, metadata, and search terms? How about data revolving around your own interactions with patrons, in committees, or even user research, surveys, or feedback? In every scenario you are thinking about, data mining and textual analysis can help; you just need to take the first step and explore it for yourself!

Technology is a bit scary. Where should I even start?

Taking that first step to explore any technology can be intimidating and challenging, but you do not have to take on this expedition alone. One of the benefits of working in an academic setting is that many researchers and colleagues around you may use these tools, know of someone who uses these tools, or can lead you in the right direction to find resources. It is time that we explore how these approaches can further our own professional duties, which in turn will help us to better understand how to meet and exceed the patrons’ and institutions’ needs.

🔥 Sign up for LibTech Insights (LTI) new post notifications and updates.

✍️ Interested in contributing to LTI? Send an email to Deb V. at Choice with your topic idea.