A digital archive for mathematics
Frank Quinn
Virginia Tech
quinn@math.vt.edu
Where will today's literature be in fifty years? How will it be found, and how much will it cost? These questions have particular urgency for mathematics because the older literature is far more extensively used than in other disciplines. Fifty years ago the answer was: on the shelves in any comprehensive libraries, freely available once you get there. Now no library can afford to be completely comprehensive. This incompleteness and the convenience of electronic materials are likely to reduce library useage. Electronic materials have a whole new set of problems. There are already enough dangling links to cause serious concern about navigation fifty years hence. Commercial links may be reliable, but excellent linking or a search tool that locates 30 hits won't help if we have to pay current copy prices (e.g. $1.00+$.25 per page) to check out each one.
This article suggests the establishment of a central archive specifically tuned to the needs of mathematics. Before describing the proposal we discuss other models, and the needs to be addressed.
Archives
The key characteristic of an archive is "a commitment to preservation - the long-term storage and maintenance of digital information objects in accessible form" [1]. Access is important: preserved material is not much good if we can't get to it or read it. Development of "metadata", classification, indexing, certification, cross-referencing, etc. are jobs for primary journals, reviewing journals, libraries, etc. Archives could use and preserve such metadata, but development is not part of the basic mission.
There is an extensive literature on management of digital information. Most of this concerns "library" issues rather than archiving. Material on archiving tends to focus on technical problems: stability of storage media, and obsolescence of software and hardware [2]. These are important issues but they are receiving plenty of attention and we can be confident that satisfactory solutions will be developed. There is some literature on the design and structure of archives [3], but it is rather speculative since there has been so little actual experience. Accordingly we begin with a brief discussion of some recent examples.
There are four current developments with some archival functions: the journal-as-database, commercial archives, preprint databases, and libraries.
Journals as databases
There are a number of electronic-only mathematical journals, so far all free. Each of these is in effect a stand-alone archive. Some journals have addressed the long-term stability problem by having their archive located in and managed by a library [4]. This helps, but are still concerns about long-term access, support for mathematical formats, etc.
Some commercial publishers are organizing their article files into archives. This is a non-trivial undertaking: translation or clean-up is usually needed before article files are suitable for network use, and a server, network connection, software, and staff support are needed. This means commercial journal databases are not free even to subscribers. American Mathematical Society (AMS) journal files are available for a 15% surcharge to current subscribers [5]. Non-subscribers can purchase individual articles through the "document delivery service".
Commercial archives
The example here is the "JSTOR" project set up by the Mellon Foundation [6]. The brief idea is to digitize the back issues of core journals in a discipline, and offer subscriptions to the resulting database. In mathematics they plan to scan the American Journal, the Annals, and the AMS journals. Much has been made of the $2 billion endowment of the Mellon Foundation, but in fact the JSTOR project is intended to be economically self-sufficient. It is also "nonprofit", but essentially commercial in design, just as the AMS is nonprofit but its journals are indistinguishable from commercial journals. Inducements, including profit-sharing, are offered to publishers for participation, but participation is expected to be limited to a small number of "core" journals.
JSTOR is designed to address the problem of back issues. The plan calls for a single format --- enhanced bitmaps --- obtained by scanning. It really should be seen as a supplement to the recent-issue databases described above. It may evolve beyond that, but currently there is a higher priority on digitizing the core in other disciplines than expanding beyond the core of mathematics, or a better interface with the present.
Preprint databases
The models above are published material evolving toward databases. Some preprint databases are evolving toward publication. The AMS preprint server is explicitly ephemeral: postings have expiration dates, and many are not stored in full-text form but as pointers to other servers. By design there is no archival function.
The theoretical physics database, the "archive with an attitude", was established by Paul Ginsparg at Los Alamos National Laboratory (LANL) [7]. This is more ambitious in design since it does contain full texts, and there are no plans to delete any files. Ginsparg's vision for journals in this context is as collections of pointers to material in the database. After peer review and editorial acceptance the file would be "frozen", and these frozen and officially accepted files would constitute the published literature. There are plans to reconstitute at least one American Physical Society (APS) journal, or an isomorph thereof, in this mode. The officers of the APS have limited enthusiasm for this since the APS, like the AMS, depends on income from its journal. The official plan is to offer APS journals in a commercial database much like the AMS.
Libraries
Many libraries are experimenting with digital archiving [8], but mostly with access restricted to the local community. Local access is the traditional mission of libraries, and most commercially acquired digital material has strong restrictions on access. Changes in copyright laws are likely to strengthen these restrictions [9]. Libraries could broaden their service by acting as resellers, returning fees to publishers through the Copyright Clearance Center. This is usually seen as contrary to library missions and inappropriate competition with commercial document delivery services. There have been suggestions that libraries could in effect become publishers of scholarly material, and maintain archives to which they themselves hold copyright [11]. There has been some movement in this direction, but it will not be a major theme.
The job
We collect needs and requirements for a digital mathematical archive.
Post-commercial publication
The most immediate need is for a stable environment for the evolution of post-commercial forms of publication. There have been forcasts of the demise of journals as we know them [10]. These may be exaggerated, but there will be serious adjustments and in ten years a substantial part of the literature will not be in commercial journals. In this context we use "publication" to refer to the quality control and certification: peer review, editorial selection, etc. These are vital functions, but how will they work outside the traditional journal framework?
Ideas for non-commercial publication go from volunteer versions of traditional journals to free-lance referees and voting. Caution is required, but it is too soon to rule out even the wildest-sounding scheme. What is clear is that all of these approaches need the enviroment provided by an archive. Editors of free journals can select and certify, but they cannot guarantee preservation in perpetuity. There must be security mechanisms to ensure that the material accessed in fifty years is the same as that "published", and these mechanisms must be credible to readers, authors, and editors. And if a group wants to try "community consensus" or freelance editing, there should be some oversight or control over who - or what - can declare something "published".
This need to support non-commercial activity requires that most of an archive must be freely accessible. There are, and will be, free databases. If the archive is not free then a great deal of the activity it should support will take place in these other databases, and the archive cannot be successful in its most important mission. This conclusion poses a serious constraint and challenge to the design of an archive.
A safety net
There are two ways an archive could serve as a safety net for commercial publication. The first concerns back issues. Many publishers would donate rights, and possibly files, for back issues, as long as it does not endanger current subscriptions. This may be the only way to bring them to life: publishers derive very little revenue from back issues now, and few math journals would get enough from electronic versions to pay conversion and archiving costs, let alone generate a profit.
The second safety net function concerns commercial failures. A commercially non-viable journal could reconstitute itself as an electronic journal available through the archive. This benefits everyone: authors, editors and readers benefit from continuity in the journal. Publishers benefit by being able to disengage from unprofitable journals without damaging or alienating the scholarly community. As indicated above, commercial failures are likely to become an urgent problem in the next decade.
Mathematical needs
Mathematics needs a specifically mathematical archive for many reasons. First, mathematics is unusual among sciences in its use of older literature, so has more need for long-term stability and access. When networks first developed the belief was that physical location of information would be unimportant: distributed material could be linked and appear functionally as a single huge document. At any given moment this may be true, but machines and addresses change, documents are discarded or updated, and network protocols are unevenly supported. In practice the functionality of widely distributed material decays over time. The best, and perhaps only, hope that links will still work in fifty years is to have it all in one place. There should be backups, mirror sites, etc., but a single authoritative copy.
Mathematical social structure and culture is adapted to a high level of quality control in publication [12], and an archive should be configured to support this. We expect experimentation with different ways to identify material for inclusion into the literature. But some overview process is needed to ensure that however it happens, the outcome is satisfactory. We want to experiment with the "control" not the "quality" in quality control.
Mathematical documents have particular characteristics that raise some problems and avoid others. For example, the need for equations and complex displays is a problem, but the almost universal use of TeX is an advantage. Consequently issues of transmission, display, and migration to new formats can be more effectively addressed in a specialized archive.
A proposal
This outlines a structure for an archive designed to meet the needs discussed above. The primary constraint is economic: since the greatest need is for a public archive that will not directly produce revenue, expenses must be minimized and other sources of support must be identified. Possible activities of the archive are described in stages depending on resources.
The core mission
The absolute minimal function is to fully automatically receive and store digital mathematical material of all kinds, and offer network access to it in the format in which it was received. Information about type and status (article, computer program, font file, style file; preprint, submitted, published, date of last revision, etc.) would be maintained. The archive would control who is authorized to change either a document or its status. For example a published document should not be changed by anyone, and an editor should be able to declare a document published only with the permission of the author. Material would be migrated to new storage media and low-level formats, but processing and high-level format conversion (eg. LaTeX to PDF or SGML) would not be part of the core mission.
Although the focus is on the public archive, there might be restricted areas. For instance commercial journals might deposit current files for safekeeping, on the condition that they not be accessible for some number of years, or while they are offered commercially by the publisher.
The second level
This concerns services that would be offered if at all possible, but not absolutely guaranteed. The principal service is format conversion. Typically an article will be stored in a source file (eg. LaTeX), while the user will want a display format (dvi, PostScript, pdf, html, ...). The archive should support standard source formats, including the use of style files and fonts deposited in the archve. It would be the responsibility of editors or authors to specify the format and be sure that the document actually compiles. The archive would then export it in user-requested formats, and migrate it to new source formats (eg. SGML) as appropriate. Facilities for automatic format conversion are being actively developed, eg at LANL, so these should be available at startup.
Another second-level activity concerns software. There is enormous activity in software development, but it is uneven in both coverage and quality. The archive might advertise math-specific needs (eg. a hyper-dvi viewer for the Macintosh), provide fonts, and test and recommend plug-and-play browser-viewer configurations. There is a significant population of users who cannot handle technical computer problems. For them the goal of providing access to archived material cannot be fully realized simply by posting files on the network.
There are activities not directly concerned with publication that might be coordinated with the archive. For example the University of Tennesee-based Math Archives is a compendium of mathematical material available on the net. It is not an archive in the sense used here, but there are good opportunities for collaboration.
The third level
At the third level are activities requiring greater resources. The most important of these is digitization of the older literature. The JSTOR project plans to scan the "most important" journals, but most of the literature will not be included. Scanned images might be donated by volunteers organized either by the archive or by "lesser" journals. Some sort of cost-sharing arrangement might be worked out between the archive and the reviewing journals to scan new material as it comes in. In a few cases it might be possible to translate old typesetting files into PostScript or other page images. However it gets done, the archive would be the repository for the result.
There are opportunities in optical character recognition. The complexity and need for accuracy in mathematics is still beyond current capabilities. The archive could develop standard examples and well-defined goals, and provide impartial evaluation of progress toward the goals, even if it were not directly involved in research projects.
Another goal is the activation of bibliographies: adding to each reference to another work, a link to a copy in the archive. Some of these will be supplied by authors. The LANL group is experimenting with automatic addition of links to appropriately formatted TeX references. Even scanned images are not hopeless. Formatting in bibliographies is distinctive and much more structured than general mathematical text. OCR with an error rate unacceptable for text conversion should still allow capture and matching of a large percentage of references. The archive should provide recepticles for such information, whatever the source. Even readers tracking down links for their own use should be able to donate this information to the archive.
Activities to be avoided
There are a number of things that should not be done by the archive, either because they are someone else's job, or they are beyond the resources available.
There should be no filtering or sorting of the material received. This is the job of journals, editors, or new publication mechanisms, and the objective is to provide an environment for this activity, not usurp it. There may be a lot of junk, but it will be harmless as long as it is not mislabeled. "Metadata" - classifications, indexing, abstracts - should be limited to material supplied by the submitter, editors, and users, and some material automatically generated from this (eg. reference links). Metadata is the job of reviewing journals, libraries, and others.
The archive should provide very limited search capability. The mission is to store material, not locate it. Since the material would be unfiltered, organizing and locating it should be undertaken by journals and other quality-control mechanisms, and the reviewing journals. Further, good search software is expensive, and configuring a database for searching may also be expensive. Someone else may want to develop search capabilities. The archive could permit this, but should not have search capability as part of its basic mission.
Organization
The need to minimize costs requires a lean organizational structure. There should be a volunteer Director or Curator who would oversee day-to-day operation, formation of boards or task forces, and recruitment and oversight of volunteers. The Director should be accountable to the mathematical community through a board of trustees, much as the Executive Editor of Math Reviews is accountable to the MR Editorial Committee. There should be a committee or other liason mechanism with the host institution. Further details would have to depend on the context.
Support
Startup costs would probably be in the $100,000 range, though a minimal version could be set up with less. The software developed with NSF support by the LANL group for the physics archive would probably be suitable with minor modification.
The server should be located in another institution so physical maintainence, systems work, network connections, etc. could be provided by their staff on a contract basis. A separate dedicated staff would be expensive and unnecessary. Under these conditions the yearly base costs would probably be in the $20-40,000 range. The proposal here is to raise this through voluntary memberships from mathematics departments and corporations. It may seem odd to propose supporting a permanent archive through contributions. However a product that could be offered commercially could easily cost over ten times as much. It would probably be as easy and reliable to collect $200 from 100 departments voluntarily, as to collect $2,000 from 100 libraries for subscriptions. Indeed it is exactly impending problems with subscription products that provide impetus for an archive. Also it is harder to get startup support for commercial products, or offerings that would compete with a commercial product. Finally, as argued above, it seems unlikely that a subscription product could address the most pressing needs. Nontheless there is a concern about long-term stability. To address this the host institution might be asked to pledge support, say up to $20,000 yearly, if voluntary contributions fall short.
Location
The archive described here needs a host institution. Criteria are: long-term stability, a provider of digital information services willing to provide maintainence and systems support on a contract basis, a connection with the mathematical community, and finally a willingness to pledge back-up support to guarantee permanence. There seem to be two possibilities: the AMS, or a major university library.
The AMS would seem to be the most natural home for a mathematical archive. Math Reviews is already an AMS subsidiary. There are two problems. First the archive would have to be insulated from the committee structure, much as Math Reviews is insulated. Recall "an elephant is a mouse designed by a committee", and AMS committees have designed many an elephant. Second it would have to be insulated from the staff. The staff has become aggressively commercial in recent years, as part of the effort to be the "publisher of choice" rather than of last resort, and to boost sagging revenues. The archive would be seen as a competitor to the AMS subscription products, and there would be pressure to try to derive revenue from it. There is a history of such conflict between the American Physical Society and the Los Alamos archive, and we should be careful not to repeat it.
A university library might be a good location. Math Reviews has a close and beneficial relationship with the library at the University of Michigan. A university with a great library and a history of strong mathematics (Harvard, Yale, Princeton, Berkeley...) would have appropriate resources and connections to the mathematical community. The local-community orientation of most libraries might be a problem: how does it serve their mission to support something principally for outside use? And particularly why should they pledge back-up support if outsiders won't contribute? There might be pressure to embed it in something larger, including statistics, computer science, or even all science, to make it more highly visible and maybe attract support as an NSF "digital library." And again there is the delicate balance between accountability and autonomy.
Conclusion
Mathematics needs a permanent public digital archive, designed and managed to address the special needs of mathematics. There are many constraints, many jobs to do, and quite a few conflicts among these. The time is ripe for a vigorous discussion of the issues.
Notes
Quotation from Preserving Digital Information, the report of the Task Force on Archiving of Digital Information.
See Rothenberg, Jeff. Ensuring the Longevity of Digital Documents Scientific American 272 (January 1995):42-47, and
Clifford Lynch The Integrity of Digital Information: Mechanics and Definitional Issues Journal of the American Society for Information Science 45(10) (1994) 737--744.
See [1], and Ackerman, M. S., and R. T. Fielding Collection Maintenance in the Digital Library (1995).
The New York Journal of Mathematics is described in Steinberger, Mark Electronic Mathematical Journals Notices of the AMS 43 (1996) 13-16.
The Scholarly Communications Project at Virginia Tech maintains files for 13 electronic journals.
Information about the AMS journals.
See the JSTOR home page, and Bowen, William, JSTOR and the economics of scholarly publication , address to the Council on Library Resources, Sept. 18 1995.
Ginsparg, Paul, The LANL physics archive.
See e.g. Project Open Book at Yale and the Scholarly Communications Project at Virginia Tech
There is a vigorous literature on copyright developments. See the National Information Infrastructure working group report, the National Conference of Lawyers and Scientists essay How does the Texaco case affect photocopying by scientists? Science Vol. 270, p. 1450--1, and
Okerson, Ann Whose article is it anyway? Notices of the AMS 43 (1996) 8-12.
Odlyzko, Andrew Tragic loss or good riddance? The impending demise of traditional scholarly journals Notices AMS 42 (1995) 49--53,
Odlyzko, Andrew On the road to electronic publishing preprint 1996,
Okerson and O'Donnell ed., Scholarly Publication at the Crossroads: a Subversive Proposal for Electronic Publishing Association of Research Libraries, Washington DC 1995, and
Quinn, Frank Postcommercial scholarly publication
Quinn, Frank A role for libraries in electronic publication EJournal vol 4
no. 2 (1994), reprinted in Serial Review 21 (1995) 27-30., and
Library copublication of electronic journals (with Gail McMillan)
Serial Review 21 (1995) 80-83
Quinn, Frank Roadkill on the electronic highway: The threat to the mathematical literature Notices of the American Math Society 42 (1995) 53-56, expanded version in Publishing Research Quarterly 11 (Summer 1995) 20-28.