Internet Archive’s Deputy Director Talks Big Data, AI, and Digital Libraries 

A Q&A with Thomas Padilla

If you’ve spent much time on the library side of Twitter, you will likely recognize the name Thomas Padilla or, as he stylizes it, 🌮 Thomas Padilla 🌮. Though initially intrigued by the taco emojis, I grew curious about Thomas’s professional journey from an academic librarian to the Deputy Director of the Internet Archive’s Archiving and Data Services Department, and curious, too, about the contours librarianship takes on within the massive informational context of the internet.

I hope you enjoy this brief Q&A. Be sure to give Thomas a follow.

It seems you began the current arc of your career with a MLIS degree. What path did you take from grad school to your current position at the Internet Archive? Where does the work of a librarian intersect with the Internet Archive?

Around 2009, I was coming to the end of my time teaching English in South Korea and was about to begin a graduate degree in history. Riding the subway in Seoul, talking with a friend about the future, I said it would be great to become a history professor or perhaps play a role in “preserving the present”—e.g., the web, social media, email, software, etc. Fast forward a couple of years, and I secured a series of internships through the Hispanic Association of Colleges and Universities (HACU) at the National Archives in St. Louis and the Library of Congress. It is not an understatement to say that I owe my career in libraries to HACU. I had not entertained the idea of entering the profession previously because of the feeling that an individual needed to take on a series of unpaid internships in order to be competitive. That simply wasn’t an option for me. I can’t help but think expectations for experience continue to negatively impact diversity in the profession.

The internship at the Library of Congress turned into a full-time position supporting the launch of a national digital preservation train-the-trainer program (DPOE). At the time, the digital humanities were experiencing strong growth, and it seemed like there was an opportunity to be had in combining my graduate training in the humanities with my experience in digital preservation. I headed to the University of Illinois at Urbana Champaign (UIUC) for my MLIS, which was a great experience. The curriculum only had two required courses, so I had my pick of graduate courses in applied computational work, where I learned network analysis and programming with Python, and also continued to deepen my digital preservation knowledge with Kyle Rimkus and the late Professor Jerome McDonough.

From UIUC, I began working in academic libraries—gaining experience in subject librarianship, information literacy, digital scholarship, data science, data visualization, and eventually, machine learning and AI. Along the way, much of that experience provided a foundation for a multiyear community capacity building effort focused on encouraging responsible computational use of gallery, library, archive, and museum collections as data—this effort continues! I also had the privilege of serving as a Researcher in Residence at OCLC Research, where I authored Responsible Operations: Data Science, Machine Learning, and AI in Libraries.

Given that the Internet Archive is a library, being a librarian prepares you well for the work. The organization is an inspiring one. The community to be served is truly global. The mission to provide “universal access to all knowledge” certainly gets you out of bed in the morning!

For many, their primary exposure to the Internet Archive is the Wayback Machine, which preserves entire webpages and more. What other projects does the Internet Archive boast?

At any given time, there are many projects underway at the Internet Archive. Speaking about efforts that I work on directly, I’m pretty excited about ARCH (Archives Research Compute Hub)—the product of a collaboration with Archives Unleashed colleagues at the University of Waterloo and York University supported by the Mellon Foundation. ARCH makes computational research and education with digital collections more accessible. The initial implementation supports working with web archives at scale.

With support from the Institute of Museum and Library Services (IMLS), we are continuing the development of ARCH with Expanding ARCH: Equitable Access to Text and Data Mining Services. Working with partners at the University of North Carolina at Chapel Hill, University of Denver, Williams College Museum of Art, Indianapolis Museum of Art, and the Opioid Industry Documents Archive, we will make it possible to work with a broader set of digital collections—e.g., images collections, text collections, and more.

In addition to this effort, my team continues to work with partners to help build community capacity. Community Webs and the Collaborative Art Archive (CARTA) are two prime examples. Since 2017, Community Webs has supported more than 150 cultural heritage organizations in their mission to document underrepresented local histories with support from the Mellon Foundation. CARTA has supported more than 40 art libraries in the preservation of web-based content related to art history and contemporary art practice with support from the National Endowment for the Humanities and IMLS.

This year, we are hosting AI4LAM, an international conference focused on the use of AI in libraries, archives, and museums. We are also taking steps to archive and permanently preserve open source AI with a number of partners. Given the increasingly pervasive impact of this technology, it is essential that it is preserved for posterity.

The Wayback Machine has always made me wonder, is the internet meant to be around forever? Social media apps such as Snapchat and the once-trendy BeReal take joy in the internet’s ephemerality, and a tossed-off tweet to an audience of ten followers seems to have little historical value. What is the case for preserving the internet?

In my opinion, the case for preserving the internet is no different than any other knowledge we aim to preserve. Questions of value shift over time as people and societies change. With the web, we have the ability to collect centralized and decentralized human expression at scale. Curatorial approaches can be many in the space—very broad collecting efforts comingle with highly curated collecting. I think that’s all for the good.

As someone whose career took off during the initial boom in digital humanities, what possibilities do you think “big data,” such as ARCH offer to scholars and researchers? Is any research using ARCH already underway?

In one sense, working with digitized collections as data allows for leveraging computation to extend disciplinary questions in a manner that wasn’t possible with the resources in their original form (e.g., paper newspapers, paper books, artwork on canvas). In another sense, working with collections as data is to work with things as they are—consider the example of collections born of contemporary knowledge production, e.g., the web, social media, email, etc. All of these knowledge forms are inherently computational in the first instance—they are data through and through—produced and circulated by computation. It follows that in order to study these collections in the manner truest to form, computational means would be integral as would a certain engagement with scale that can feel unfamiliar depending on profession or discipline.

Regarding specific ARCH research use cases, we were fortunate to support a number of different internationally distributed research teams with our Archives Unleashed colleagues. Titles for those research efforts are instructive, so I’ll repeat some of them here: “AWAC2 Analyzing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset” (Valérie Schafer et al.); “Using Web Archives for Mapping the Use of Cultural Practices in Postconflict Societies and During Reconciliation Processes” (Ricardo Velasco Trujillo and Luis Gomez); “Web Archiving and the Saskatchewan COVID Archive: Expanding Coverage to Capture Social Media, Medical Misinformation” (Jim Clifford et al.), and “Radicalization; Latin American Women’s Rights Movements: Tracing Online Presence through Language, Time and Space” (Sylvia Fernandez et al.). More recently, we’ve seen examples of ARCH research use cases focused on the study of modern music history and library practitioner use cases focused on ARCH as a tool for collection curation at scale.

I’m not surprised to hear that AI is on your mind. You recently gave a talk at ACRL on the “mutualistic relationships” between GLAM professionals and AI tools, which I assume plays into the topic of AI4LAM. How is this relationship mutualistic? Is there a way of conceiving of these tools as mutually beneficial while also bearing in mind the real ethical and privacy concerns these tools raise?

I suppose I should say that I think the relationship between librarians and AI is aspirationally mutualistic. As librarians, we have a long history making use of various tools and technologies. Across each usage, our work is shaped by the tool we hold in hand—metaphorically and literally. The profession contending with the challenges and opportunities of increasingly ubiquitous access to the internet is an instructive sea-change-y sort of lesson from the not-too-distant past.

What remains unclear to me is the extent to which our profession has the ability to impact the creators of the various, predominant AI and ML tools coming into common use. Whether open or proprietary, significant challenges are presented. Questions of ethics or privacy are salient to either path. On the open side of things, we are presented with a challenge of collective action—notoriously tricky to affect. On the proprietary side of things, we encounter a for-profit motive that history has shown rarely balances harms and benefits well.

Not infrequently, I run into people who dismiss concerns about AI with a sort of teleological sense of historical progress, where the inevitable end point of imperfect technologies is a more perfect technology with less harm and more benefit. I find the view distasteful and callous, almost as if harms affected over the course of implementation are simply acceptable losses in the grand sweep of history. More people should reflect on what positionality and advantages allow them to slide so easily into third-person omniscient view.

Yes, the teleological notion that our technological trajectory is outside of human control is a dangerous one. On a closing note, then, as librarianship and libraries integrate further with digital technologies, what is one value librarians shouldn’t forget?

My answer to that question is historical rather than technical. It is important that we not forget our histories. Critical historical reflection grounds us. We should consider the cycle of change we find ourselves in—emphasis on history as cyclical rather than a metaphor of change over time where we sit uncomfortably atop an arrow hurtling toward some unknown future. History is full of people thinking they were sitting on a special arrow when it was really just their turn on the carousel. We can be more confident in our sense of what lies around the bend.

