The HathiTrust has rapidly become the largest digital library in the world—and proven what impossible dreams are possible when we work together to achieve them.
In the pre-Web 1980s, research librarians began to catch hopeful glimpses of the kind of “collective memory machine” that Vannevar Bush in the 1940s futuristically described in his iconic article “As We May Think.” Bush envisaged a massive information repository connected to a navigation device that mimicked the web-like thought association processes of the human mind. Although Bush didn’t conceive of libraries that were “digital,” he forecasted that “the Encyclopedia Britannica could be reduced to the volume of a matchbox [and a] library of a million volumes could be compressed into one end of a desk.” The specifics of imagined technologies aside, Bush envisioned something as big and interconnected as the World Wide Web that harnessed the post-war achievements of science to revolutionize information access towards the betterment of society. In many ways, he presaged the Web, as well as the idea of a vast library of information available in a thoroughly networked environment.
Fast forward 60 years and the seemingly sudden availability of digital content in vast quantities combined with high-speed global networks has turned what we once may have considered as “dreams of a digital library” into a great frontier for research libraries. And there is some urgency to the trek. Increasingly, the vast majority of library use takes place beyond library buildings—in labs, offices, and coffee shops around the campus, the state, and the world. The desire for anytime, anywhere electronic access to the books, journals, media resources, and archives of a great research library has clearly become the University community’s dream, if not its expectation. However, to build a digital library on the multi-million-volume scale of our great research libraries requires expertise and expense beyond any single institution’s capacity. This can only be the work of many, as one.
Libraries have a long history of committing “acts of cooperation” to achieve a greater public good. Cooperation exists in our cultural DNA, an instinctual response to provide the “best books for the most people at the least cost” (a motto for libraries attributed to Melville Dewey). The 40-year-old Minitex resource-sharing network and much younger Minnesota Digital Library, both with operations based at the University of Minnesota, give ample evidence that libraries, working in cooperation, can provide the public with a wealth of services that would simply be impossible for many libraries to deliver on their own, ever.
The University of Minnesota has served as the foundation for much of the significant library cooperative achievement in the state and region. But what happens when the University dreams of something that requires effort far greater than what it can do alone?
Our big—and perhaps once wildly futuristic—dreams for research libraries are possible if we take cooperation to the next level, says Columbia University’s Jim Neal, one of the thought leaders in research library administration. Neal calls upon our great university libraries to enter an age of “radical collaboration.” By this Neal means that libraries that share goals and a willingness to share risks must go beyond cooperation to create new structures, services, and breakthroughs. Before our eyes, the HathiTrust Digital Library (hathitrust.org) is emerging as a premier example of Bush’s vision for massively scaled information access brought about through acts of radical collaboration.
A Frontier Unfolding
Since the fall of 2008, over 60 of our nation’s largest and most prestigious research libraries, the University of Minnesota among them, have entered into the HathiTrust partnership to build what some claim to be the world’s largest digital library. Leveraging the digitized copies of books and journals returned to libraries by the Google Book Project, along with digital scans from other initiatives, the HathiTrust library has now grown past 10 million total volumes and 3.5 billion full-text searchable pages (but not all viewable, due to copyright). At the moment, all the books and journals in the HathiTrust would fit on a shelf 120 miles long!
Hathi is a Hindi word for “elephant,” an animal highly regarded for its memory, wisdom, and strength.
With very few exceptions, the size of the HathiTrust collection now exceeds that of nearly any single university library collection in North America (for example, our collection at Minnesota is just over 7 million volumes). It has approximately 500 languages represented, with publications dating from the pre-1500 period to the present. While the University of Minnesota has to date contributed around 100,000 volumes to HathiTrust, nearly 40% of the works in our collection are already represented in HathiTrust. This overlap is expected to grow to 60% by 2014. This is the power of the “collective collection,” a phrase coined to describe the synergy when research libraries bring together their treasured collections to provide a level of access that no one library can achieve alone.
Words are read, but in the digital environment, they can also be computed upon in the pursuit of new knowledge. This past year, the HathiTrust Research Center was established to enable advanced computational access to the massive collection of digitized text in HathiTrust. Using advanced software tools, researchers will be able to engage in “text mining” to pursue powerful new avenues of text analysis research.
Long-term access to the information contained in our libraries—whether the works are in print, digital, or other formats—depends on a commitment to preservation. While HathiTrust brings an unprecedented level of digital access to research library collections, it is equally committed to preservation of the digital copy over time. Following digital preservation standards and best practices, coupled with use of robust technologies to ensure multiple copies of digitized books and journals are safely kept in widely separated geographical locations, the HathiTrust is exemplifying preservation practices in the digital world.
Complexities and Challenges
Building a digital library on the scale of HathiTrust is not solely a technology challenge. In fact, when the history of HathiTrust is written, technology may end up viewed as among the easier parts of this remarkable undertaking. Duplicate digital copies resulting from bringing together large research collections, quality control, and intellectual property issues are challenges that cannot be entirely resolved by twenty-first century digital engineering capabilities. Take copyright, for example. HathiTrust makes full use of works in its collection that are in the public domain. Of its 10.1 million volumes, about 2.7 million volumes (or 27%) are fully searchable and viewable, cover to cover. HathiTrust also aspires to make lawful uses of works in copyright (for users with “print disabilities”) or are of indeterminable copyright status (the so-called “orphan works”). These are areas where there may be lack of complete legal clarity or precedence, and HathiTrust is defining its policy and advocacy role.
HathiTrust partners share in addressing these specific challenges, as well as overall planning and governance, technology development and evaluation, and costs of operations. This past fall, the HathiTrust held its first Constitutional Convention, complete with ballot measures, to chart its next stage of governance and programmatic focus. Which brings us back to the long road that HathiTrust has traveled in a remarkably short time to harness vision, use of advanced information technologies, and a daring dose of radical collaboration to achieve the common good.
John Butler is the University of Minnesota Associate University Librarian for Information Technology. He arrived at the University in the late 1980s as a freshly minted, ideas-brimming librarian, holding glimpses of a digital library. He now serves on the HathiTrust Strategic Advisory Board.