Subject: Preservation of electronic formats
A Report by Peter Jermann on THE PRESERVATION OF ELECTRONIC FORMATS presented by Dr. Michael Spring at the Preservation Intensive Institute August 1-6, 1993 University of Pittsburgh We were bitted, byted, and nibbled until we grumbled. We were ASCIIed and UNICODEd until we groaned. We were TIFFed, JPEGed, and SGMLed until we screamed. And we were RLE, Huffman, and LZW compressed until we exclaimed "Shoot me, please!" But when the volcano of information that is Dr. Michael Spring stopped, the swirling lava of detail became stone, the world became calm, ... and we saw meaning. The purpose of this report is not to summarize the week but to reflect on the lessons learned. The reflections that follow are the outcome of one of the three discussion groups, that met on the final day of class, in combination with my own ideas. Where these reflections seem correct please credit the discussion group, where they seem misdirected please blame me. As people concerned about the implications of electronic technology for preservation, we need to understand three basic concepts: 1) all digital information is coded information; 2) digital technology can be used as a tool for preservation; and 3) regardless of whether we exercise the options presented by number 2, we have to cope with analog and digital electronic records that already exist or will be produced. A more detailed consideration of each topic follows. 1. Digital information is coded information. The first concept, that all digital information is coded, means that in order to preserve digital information and provide future access to this information, we must also preserve the key that translates the code. Digital information is merely a series of 1's and 0's (bits) gathered into parcels of 8 (bytes) that are gathered into collections called files. The meaning of these bits, bytes, and files can be, and often is, arbitrarily determined by the information's creator. Though we may come to excel in preserving the physical media on which digital information is stored, without the key to decipher the information preserved, future access will require the services of a cryptographer. The solution for those faced with the responsibility of transferring information to or preserving information in electronic formats is an awareness of and support for standards that define the meaning of digital information. We need to know who creates standards and how we can influence their development. We need to be aware of existing standards such as the ASCII standard (American Standard Code for Information Interchange) for text and TIFF (Tagged Image File Format) for graphics, as well as the hundreds of proprietary formats established by software vendors. Finally, we need to look to the future and support both the development and the use of emerging universal standards such as UNICODE (ASCII code + a possible several billion English and non-English national characters), TIFF and SGML (Standard Generalized Markup Language - a standard used to describe documents which may include textual data, image data or other data in predefined formats). These reflections led to the following recommendations: a) A national repository should be established to preserve both public and proprietary standards for interpreting digital information. This recommendation is based on the importance of this information for the preservation of digital information and the realization that such a task would be impossible for any individual library to assume. b) As a profession, librarians and preservation professionals need to develop a forum (journal, electronic journal...) where digital standards can be discussed and explained in terms comprehensible to members of the profession. 2) Digital technology as a preservation tool. Digital technology can be used as a tool for preserving information currently in non-digital format. The uses of this technology include analog to digital conversion (for sound and video recordings), image to digital (for documents, books, photos, etc.) and text to digital (OCR/ICR - optical or intelligent character recognition). In order to understand conversions to digital format, we must understand how the digital record relates to the original. What do we gain and what do we lose? All digital conversions are based on series of discrete samples of the original information, whether an analog recording, a page from a book, or a photograph. The completeness with which the original information is captured is determined by the distance of these samples from one another, either in time or space, (e.g. dots per inch, samples per second) and the quality of each sample taken. The more samples taken and the higher the quality of the sample, the higher the potential resolution of a digitally reproduced copy. The quality of the sample directly relates to the size of the scale by which the each sample is measured. For example, if a color photograph is scanned into digital format, we could sample at three quality levels ranging from low to high. We can sample it as a black and white image (using only two values for any given sample), a gray scale image (up to 256 different values of gray per sample) or a full color image (up to 16.7 million different color values per sample). Once we understand the relationship between the original information and its digitized copy we need to understand the limits and costs of the technology. What are the costs of information input, information storage, and information output? What information might be lost, or enhanced? What are the limitations of image to digital or audio to digital conversions? Once information is digitized how do we store it? The quantity of information digitized represents its own limitations. The more information we save (higher sampling rate and/or more values per sample) the higher the cost of processing and storing that information. Storage requirements can be tempered by a variety of data compression schemes. As preservation specialists we must understand that compression algorithms can be lossless (no information lost on decompression) or lossy (decompressed information differs from the original compressed information). What do we gain and what do we lose in such schemes? Once our information is digitized and compressed, how do we organize the data (see discussion on standards in part 1 above), and how do we retrieve it? What are the limitations of converting a graphic image of text, such as the scanned image of a page from a book, into keyword searchable, character based information? What are the advantages or disadvantages of CD-ROM? How stable is the physical medium? Only when we understand the technology involved in the hardware and software, can we make decisions concerning the uses of digital technology as a preservation tool. These decisions must be guided by answers to the following questions: - By whom and how will the information be used? - Can digital technology achieve the quality required by these perceived users and uses at a cost we can afford? - Will access to a digital copy increase or decrease demand on the original? - Should textual information be digitized as an image, as is currently done with microfilm, as text that can be indexed and searched on a computer, or should it be digitized in both formats? - How can we index or catalog the digitized information? The answer to these questions will help us to answer questions like the following: - Is digital technology the answer to this particular application or should more traditional means be used? - At what resolution should an image be scanned or an analog audio recording sampled? - Should basic black and white printed text be scanned as a black and white image or as a gray scale image? - Should we use a lossy or lossless compression to store our data? 3. Coping with electronic records As preservation specialists we must learn to cope with existing electronic records as well as those we produce through our preservation efforts. Preservation of electronic formats requires that we know and understand a) the logical format or code by which the information is translated to human terms; b) the technology that can read the information on the particular medium on which the electronic record exists; and c) the life of the medium on which the information is stored. a) The importance of the encoding scheme or format of digital information places two requirements on preservation specialists. First, it requires that as digital information is collected we must acquire knowledge of the format in which it is stored. This knowledge must be inextricably tied to the electronic record through cataloging or other means. Further, it is necessary to insure that the specifics of the format are preserved somewhere (see part 1 above). If the format is peculiar to the records in hand, as may be the case with a custom-designed software application, then we must also obtain a detailed record of the encoding of that format and ensure that it remains tied to the data. Second, we must support the development and use of universal standards so that the problems associated with the existing standards lessen with time. b) Whereas the format tells us how information is arranged within a digital file it tells us nothing about the mechanism required to read the digital information from the particular medium on which it is recorded. Information regarding this reading technology, like the format or coding information should be tied to the electronic record through cataloging or other means. We need to know and document the hardware or combination of hardware and software that enables us to read the electronic record from the particular medium in our possession. We cannot assume that all similar media require similar technology to read. Magnetic tapes, for example, can contain digital or analog information and can only be read by an appropriate machine. Floppy disks are an example of media that can be physically identical yet incompatible. Disks formatted on an Apple computer are not easily read on IBM compatible computers, nor are IBM formatted disks easily read on Apple computers. c) Finally we need to understand that the life of the medium (magnetic tape, floppy disk, CD-ROM, etc.) on which electronic information is stored depends on a combination of two factors. The first is the rate of the medium's physical decay. How long is the medium capable of maintaining its information intact? The second factor is the life expectancy of the technology used to write to and read from that medium. Should this technology disappear the information on the medium becomes inaccessible. This combination of factors affecting the life of a electronic medium requires the preservation specialist be diligent on two fronts. He or she must act in a traditional sense and monitor the condition of the artifact on which the electronic information is stored. When the artifact can no longer sustain the information, like a brittle book unable to support its printed message, it must be copied or its electronic image refreshed on the existing medium. Unlike a brittle book, however, the preservation of electronic media requires that the specialist monitor the technology that placed the information on the electronic medium. The obsolescence of the brittle book's printing technology has no impact on its preservation. The obsolescence of an electronic reading technology can mean loss of access to the information stored with that technology. Consequently, in addition to monitoring the artifact the preservation specialist must monitor both the reading technology connected with the artifact and emerging technologies that will supersede that technology. It becomes his or her responsibility to migrate data to the newer technology before the old technology disappears. Fortunately, a significant advantage of digital electronic data (though not analog data) is its ability to be refreshed, without loss, on its existing medium or transferred, also without loss, over a wire from one computer to another regardless of hardware and/or software differences. If the reading technology is properly documented (see part b above) any data produced by a given technology can be quickly identified en masse and transferred to a newer medium. NOTE: Special thanks to Michael Spring, Shannon Zachary, Karen Motylewski, Barclay Odgen and my wife, Mary Jermann, for reviewing the draft of this essay and offering comments, criticisms and encouragement, all of which have made it better than it otherwise would have been. Pete Jermann Preservation Officer Friedsam Memorial Library St. Bonaventure University St. Bonaventure, NY 14778 (716) 375-2324 *** Conservation DistList Instance 7:20 Distributed: Monday, August 16, 1993 Message Id: cdl-7-20-001 ***Received on Monday, 16 August, 1993