Robert D. Cameron CMPT TR 94-08
School of Computing Science
Simon Fraser University
cameron@cs.sfu.ca
Thu Dec 1 10:54:42 PST 1994
With the spread of gopher in 1991 and the World-Wide Web soon thereafter, it became possible to provide organized access to internet resources through well-designed, easy-to-use menus and web pages. Thus began a number of experiments at sites throughout the internet in the creation of internet electronic libraries, various called ``virtual libraries,'' ``digital libraries,'' ``libraries without walls,'' and so forth. At Simon Fraser University, we have conducted our own experimental development of an Electronic Library, with a particular focus on collections development in computing science. Interest in electronic library development has continued to grow and has seen the establishment of conferences devoted to the topic [SLFM94][Rut94] as well as major projects to investigate new technologies and organizations for digital libraries [WW94][BH94].
In this paper, our concern is with a commonplace problem in present-day electronic libraries: failure to provide the most authoritative, up-to-date, reliable or useful forms of access to internet materials. These failings arise from that most of us have tended to employ relatively ad hoc collection management processes as we've experimented in the development of electronic libraries. As a step towards improving the collection management activities in internet electronic libraries, then, this paper proposes four principles for material acquisition and outlines possible techniques and technologies for applying the principles.
To install a new e-text, e-serial or other resource in your collection, the first step should always be to identify and link to the original source. This will provide users with an access method to locate the most up-to-date and authoritative source for the item in question. Of course, it may be that the originating site is remote, unreliable, poorly organized, or accessible using only a lower-level protocol (e.g., ftp). However, these are reasons for providing additional methods of access to the resource. It will still be valuable to have the original source available when questions of authoritativeness or currency of information arise.
Failure to install links to the original source can easily cause problems. For example, consider the case of the e-journal The Public-Access Computer Systems Review, published by the University of Houston Libraries. It is not difficult to determine an appropriate original source of this publication on the University of Houston Libraries gopher server, URL gopher://info.lib.uh.edu/11/articles/e-journals/uhlibrary/pacsreview. On October 12, 1994 and again on November 29, 1994, small surveys were conducted to see how various internet electronic libraries provided access to this e-journal. The University of Houston archive understandably had a complete collection up to and including volume 5, number 6 on October 12 and volume 5, number 7 on November 29. In addition to the Houston archive, five other archives were identified at notable gopher-based internet libraries[+]. Unfortunately, none of these archives provided an up-to-date collection at the time of both surveys. Three of the archives were significantly out-of-date with several missing issues during both surveys. The other two did better and were each up-to-date during one of the surveys and somewhat behind during the other. Fortunately, the CICNet Electronic Journal Collection-widely used as a principle archive for electronic journals-did in fact provide access to PACS Review through a direct link to the Houston server. However, this appeared to be a recent development; at the time of the first survey, veronica searches were still yielding references to a defunct archive that had apparently been maintained previously at CICNet.
In addition to providing an out-of-date collection, failure to link the original site may also prevent users from discovering important additional resources related to the item. In the case of PACS Review, the University of Houston Libraries menu provides access to all issues of the journal in a self-extracting PKZIP file, information on ordering a print version of the journal, guidelines for prospective authors, and a top-ten listing of journal articles. Furthermore, as one might expect, the organization of the PACS Review menus at Houston is better than that at most of the other sites.
Linking to the original source is such a good first step in adding an item to the collection that it often makes a good final step as well! If the original source is reliably maintained, easily accessible using all the common clients, and provides good organization and indexing for the item in question, there may be little to gain in doing anything else. However, the converse is also true: additional work may well be justified in order to enhance the reliability, accessibility, organization or indexing of an item.
In internet terminology, a mirror is an archive at one network site that faithfully reproduces an original archive at another site. Mirrors are useful to provide an efficient means of access to clients that are ``electronically closer'' to the mirror site than to the original archive[+]. Mirrors also provide greater reliability: an alternative source for the material when the original archive is unavailable (either temporarily or permanently). Mirroring is widely used in ftp-space based on the mirror software package [McL94]. This software automatically tracks changes at originating ftp sites and duplicates those changes at the mirror site, with a variety of options.
True mirror technology for other protocols is less well developed. For gopher, proper mirroring does not appear to be available, but copying of remote gopher directories and hierarchies is possible using the gopherdist and gopherclone utility programs available with the Unix gopher software [Unk91]. A program for mirroring in the World-Wide Web also appears to be under development [Kos94].
In the case of LISTSERV archives, however, it is possible to go beyond simple mirroring (creating a duplicated LISTSERV) and instead provide enhanced access by automatically mirroring materials into gopherspace and/or the Web. With a subscription to the LISTSERV in place, the mirroring can be driven by the receipt of newly archived items via e-mail. Scripts in a suitable language (perl, perhaps) can be used to automatically extract relevant information from the incoming e-mail to create the appropriate gopher or WWW data files. One example of this is the Mr. Serials system at North Carolina State University for (semi-)automated acquisition of electronic serials [Mor94]. Although apparently not used in the Mr. Serials project at this time, fully automated processing of incoming e-mail can be achieved using the procmail package [vdB94].
Ultimately, it will be increasingly important to have reliable, automated mirrors to make internet electronic libraries become practical everyday tools. Ideally, Library menus for particular items should include links both to the original archive and to any mirrors that exist. The development of better mirroring software and new client-server protocols that understand the concept of mirrors are important areas for further work.
Beyond establishing links and mirrors for items in the collection, it is also very useful to provide additional information or indexing related to the item in question. For example, in the case of The PACS Review, there is a searchable full-text index (freeWAIS) available in the North Carolina State menu, a facility not available from the Houston archive. This represents a value-added resource for PACS Review and is worth including together with the link to the Houston archive. As a more elaborate example, consider the Communications of the ACM entry in the computing science journals collection of the SFU Electronic Library. This menu links to the publisher (both the ACM Web Page and the ACM gopher server), contains a description of the journal from the publisher, includes three sources for bibliographies in different forms, links to a full-text archive at Arizona and also to a WAIS-searchable full-text archive. In this way, we attempt to provide users with a comprehensive collection of on-line resources related to this item.
Adding value to items in terms of material organization, indexing, presentation, and so forth makes a key difference between an electronic library that merely groups together existing items on the Internet and one that provides additional value to its patrons.
Finally, it is important to recognize that internet links need regular maintenance to ensure that they are up to date. Internet links are often dependent on several attributes, typically including a protocol (e.g., gopher or http) for accessing the item, an internet address for the computer (server) which provides the item in question, a port number at which the server listens to requests using the given protocol and a local filename, pathname, or other string that the server uses to retrieve the specific item in question. For example, the universal resource locators (URLs) used to make links in the World-Wide Web are essentially notations for specifying such attribute combinations in a form which allows unambiguous retrieval of an item [BL94]. Unfortunately, this means that installed links in an internet electronic library will need to be updated whenever any one of its attributes changes. Reasons for such changes seem frequent and varied: reorganization of material to a new structure, moving information to a better maintained or more powerful server, switchover from one protocol to another for providing information, and so forth. Even though most serious information providers attempt to provide relatively stable access to their resources, in the present context of the internet, ``stability'' may mean no more than one or two years without change. At present, then, internet librarians must expect a significant percent of installed links to be changing each month.
Link maintenance is a two-part activity involving link monitoring to check for broken links and link correction to repair, replace or delete broken links.
One relatively passive for monitoring links is to solicit feedback about broken links or other problems directly from the users. Many electronic libraries include e-mail contact addresses to report problems in a top-level menu. It may be more valuable to include a ``feedback'' selection with every menu or information page, so that it is immediately visible to the user whenever an error occurs. Indeed, the top-level directory of a library may not be directly accessible to a user who has accessed a submenu through an external link or as the result of a search. It is now fairly common to find WWW servers with contact addresses on every information page, but the corresponding practice in gopherspace is rare.
A much better approach to monitoring, however, is to employ active process of routinely checking all the external links from your site. It is possible to do this manually, but tedious. However, it is also possible to automate the process with ``link checking'' programs that exhaustively test every link at your site. Care should be taken in using such programs because of the potentially high volume of network traffic they can generate. One possibility is to divide up the data directories on your server into separate groups and test one group each night on a rotating basis. The MOMspider [Fie94] and verify_links [Tec94] utilities are two relatively sophisticated programs for automatic checking of links in the World-Wide Web. Only rudimentary link checkers seem to be available for gopherspace, such as the go4check utility [The94]. In any event, link checkers of any kind should be thoroughly evaluated by local technical staff before they are installed and used for regular maintenance activities.
Another useful-but somewhat limited-monitoring technique is to have the gopher or Web server software log all attempted accesses so that the logs can later be checked for errors. This can catch internal errors in data directories as well as certain types of error in gatewayed external links. However, errors in most kinds of external link will not show up in server logs. Once a server has returned such an external link to a client, it will not be involved in any attempt by the client to access the external resource and hence there will be unable to record any log entry for such an access. On the other hand, errors in external links from other sites to your server will show up in the log file. By analyzing these errors, it may be possible to track down outdated external links at remote sites and request that they be updated.
Once appropriate monitoring techniques to identify broken links are in place, the problem of effectively and efficiently correcting them remains. In small collections, it may be entirely feasible to perform this corrective maintenance by direct manual modification of server data directories. However, as collections grow in size, manual correction of broken links will be increasingly tedious to perform and likely to introduce further errors.
One technique that we are finding useful in maintaining the SFU Electronic Library is to use scripts to generate or regenerate server data hierarchies. This allows maintenance work to be localized in a single place, rather than throughout many indnvidual files in a large hierarchy of directories, subdirectories and so on. Furthermore, when a series of links are to be made in some regular fashion to items at a particular remote site, script macros can be used to encapsulate this information. This can allow the entire series of links to be easily corrected by updating the macro in cases where the the server, port or path information at the remote site is changed without changing the basic organization of the information itself. Script macros also have the advantage of uniformity in the links they generate; this can be quite beneficial for patrons learning to navigate through library menus. Furthermore, scripts and script macros may make it easier to change to alternative server technologies as they become available by encapsulating the dependencies of the information hierarchy on the particular conventions of an information server in a single location.
Beyond the use of scripts in this fashion, additional automated tools and techniques to assist in link correction are an important area for further work.
The principles described here are a first attempt at improving the collection management practices of internet electronic libraries. They have largely been developed through weaknesses we have identified in our own electronic library work at SFU. However, we have also observed similar failings in other electronic libraries on the internet. In applying the identified principles to the SFU electronic library, we hope to considerably improve the quality and reliability of our collections. We hope that internet librarians at other sites will also find them useful.