Tales from “THE disK FILES”: Lessons Learnt from a Data Recovery Project in 2003–2006 at the National Archives of Australia
This case study re-evaluates a large-scale project carried out by the National Archives of Australia (NAA) between 2003 and 2006. The project aimed to identify obsolete digital media (physical data carriers) in its collection and to describe and recover the data from the carriers using a third-party data recovery provider.1 A detailed process for data recovery was developed that included the capture of a full audit trail of steps in the data recovery process. The project was completed in four stages: phase 1 obtained bit-level images from the carriers; phase 2 extracted individual bit-files from the carriers; phase 3 identified duplicate files and proprietary or complex file formats; and phase 4 was a final report that documented processes, made recommendations on future processes, and provided lessons learned. Recent work described in this article indicates that files extracted from the carriers in 2004–2005 can be accurately rendered in current computer environments. The ongoing significance of the project is that it is an early demonstration of the success of bit-level preservation and the need to create disk images as part of a preservation workflow, suggesting a sustainable methodology for digital preservation. The project also influenced archival policy at the NAA and influenced the development of subsequent software tools that became widely known in the broader digital preservation community. The focus on archival principles of authenticity, integrity, chain of custody, and provenance of the recovered records were key learnings to ensuring long-term access and usability. Finally, the metrics resulting from the project, for example, rates of readable carriers and rates of data recovery by carrier type, are useful data from a point in time that correspond quite closely to similar data recovery projects undertaken by other institutions at about the same time and provide a benchmark for future research.ABSTRACT
2003
A pair of shadowy figures, clothed in thick, fleece coats to protect them from the cold, pull open the sliding air-lock door and enter the cold room vault. They nevertheless feel the near-zero temperature as they venture into a large, open space and peer into the gloom. One of them flicks a light switch, and rows of fluorescent bulbs illuminate compactus shelving on either side of a narrow corridor leading away into the depths of the repository. Moving past the large, heavy-duty drawer units that contain hundreds of microfilm reels, the figures move further into the vault, conscious of the unsettling stillness and silence around them. The focus of their investigation is an open row of shelves that contain an odd assortment of differently shaped boxes, each with strange hieroglyphics written on its side, in fact, the numbering system controlling the boxes in the series. They open the first box, a square, flat one that looks like a pizza box, and take out a large plastic reel with a scribble of barely decipherable writing on an age-darkened label that is starting to peel away. They huddle over the reel and make some notes in an exercise book, including a transcription of the writing on the label. The next box is a standard archival container that is packed full of thin, square plastic objects in paper sleeves. For the rest of the day, they carefully open each box in the compactus bay and make detailed notes of their contents—tapes, disks, cartridges, of all shapes and sizes. Finally, their work finished, they head out through the air-lock into the warmer air.
“You know, Dave,” the shorter one says, pulling a handkerchief out of the pocket of his corduroy trousers and cleaning his spectacles. “I wonder how much, if any, data we'll be able to recover from this stuff?”
Dave shrugs. “Who knows Brendan? But there's going to be an awful lot of this work to do in the future, so we'd better get it right this time.”






In telling this story of shadowy figures, lost archives, and secret vaults, we will take you on a trip into the past to a time when digital preservation theory and practice were still in their infancy, when universally agreed digital preservation principles and workflows were still being developed, and before familiar standards such as PREMIS2 and OAIS3 were developed.
Like other government archives, the NAA collects official Commonwealth government records and the personal records of significant individuals closely connected with the government in an official capacity, such as governors-general and prime ministers. The NAA holds over 365 kilometres (226.8 miles) of physical records and over 2 petabytes of digital material (primarily AV material), which is growing rapidly as a result of large-scale digitization work. Some of this digital material is held on obsolete physical carriers such as floppy disks, magnetic tapes, and data cassettes. As long ago as 1991, it was recognized that managing digital material on carriers in an offline storage environment poses a serious risk to ongoing access.4 Published research such as that by Rothenberg5 and Garrett and Walters6 provided further impetus to address risks of carrier and file format obsolescence.
In 2003, the NAA commenced a project to identify digital content on obsolete carriers and describe and recover the data from them. A detailed process for data recovery was developed that included the capture of a full audit trail of steps in the data recovery process to ensure fixity, provenance, authenticity, and the chain-of custody for archival management.7 The data recovery project was classified into a four-phase process: phase 1—obtain bit-level8 disk images9 of all of the content on each physical carrier; phase 2—extract individual files from each of the physical carriers; phase 3—analyse and identify duplicate files and proprietary or complex formats; and phase 4—document the results for future archival reference and preservation processes.10 An additional fifth phase was proposed but not enacted at the time, which was to investigate and use appropriate software to render or display the files recovered in phase 2 or disk images from phase 1 if the files were unrecoverable. These steps are described in detail in the Methodology section following. While the project has been referenced in a number of published articles since the project was completed, this article is the first detailed description of the scope, methodology, results, and lessons learned.11
Recent work by the NAA has demonstrated that some of the files extracted from obsolete carriers as a result of this project, and subsequently stored in a preservation system, can be accurately rendered in current computer environments. A key message is that recovering the bits and ensuring they are properly cared for when it is still possible to do so will lead to positive outcomes that may not be fully realized for years into the future.
Agency to Researcher Project
In 2002–2003, as part of a broader digital preservation project called the Agency to Researcher Project, the NAA commenced a number of research studies designed to inform its overall digital preservation approach, to test its assumptions, and to fully understand the environment in which the archive was operating.
The research studies comprised
A report setting out the conceptual understanding of digital records that form the basis of the NAA's approach to digital preservation. The result of this work was the digital preservation Green Paper, An Approach to the Preservation of Digital Records;12
The design and construction of a purpose-built digital repository, open-source XML normalization software (the software was called Xena, which stands for XML normalizing of archives), and workflows to ingest digital records;13
An investigation of researchers' expectations of preserved digital records;14
A test transfer of digital records from a Commonwealth government agency (Australian Wool Research and Promotion Organisation);15 and
A project, formally named the Legacy Media Project, to identify existing digital records on legacy media (physical carriers) already in the NAA's custody and to make those records accessible to researchers.16
One of the objectives of the Legacy Media Project was to develop a methodology for recovering data from legacy physical carriers and to implement that methodology on known legacy carriers in the custody of the NAA. Many of the records on the carriers related to high-profile public inquiries, such as Royal Commissions and Commissions of Inquiry, and so had high secondary value.17 Within the NAA at the time, the speculative expectation was that about 30% of data could be recovered from carriers dating from the 1970s, 1980s, and early 1990s, even though they were stored while in NAA custody in environmentally controlled repositories. It was unknown if the data or the carriers themselves, especially the 9-track ½" magnetic tapes, had become compromised and could be read more than once due to
Obsolescence of the carrier type;
Decay of the data on the carriers;
Obsolescence of the hardware/software mechanisms to access the data on the carriers;
Incorrect storage resulting in the failure during of the data recovery process from deterioration such as “stiction,” where the tape substrate had bonded together causing friction resulting in the stretching and/or breaking of the tape(s).18
It was also unknown how much of the material on the physical carriers was duplicated in paper form or if it was all original archival records. Although uncertainties existed about whether outcomes could be achieved, the NAA decided to undertake the project, first, because doing nothing was not an option because the records were high value and leaving them on the carriers risked record loss, and second, because the project provided the opportunity to test hypotheses and to develop workflows for digital archiving, storage, and long-term preservation.
Legacy Media Project
The project team consisted of two staff, an archivist, and a digital archivist, each working on the project at 0.3 full-time equivalent. The archivist carried out the collection survey to identify obsolete carriers for recovery treatment at the beginning of the project in 2003 and was available for consultation on a needs basis afterward. The digital archivist managed the project from 2003 to 2006, including developing project documentation and controls, establishing the contract with the successful vendor, liaising with the vendor, and overseeing data recovery and quality control, as well as developing the control and recording documentation. The project consisted of a number of phases in which the data stored on obsolete carriers was progressively extracted from the physical carriers and a detailed description of each phase produced. In this way, each phase resulted in not only the recovered data, but also a complete record of the recovery process, including establishing a fixity point and a digital verifiable chain of custody. Work on the project came in peaks and troughs: 2003 consisted of the project initiation, research, and tender process; the recovery work took place between 2004–2005; and the lessons-learned documentation was produced in 2006 (see Figure 1). At the time the project was initiated in 2003, external (vendor-developed) data recovery processes were highly proprietary and the domain of specialist operators, and there was limited in-house expertise, capacity and equipment at the NAA. Besides pioneering work such as those by Woodyard19 and Ross and Gow20 on data recovery, there was little literature on how to conduct a data recovery process in the GLAM sector (in contrast with digital forensics in the legal domain). In the years since, the increase in published accounts of data recovery projects and the development of open-source tools discussed below21 have been exponential. A considerable amount of work has been published on the application of digital forensics in the GLAM sector.22 However, the pioneering work at the NAA is notable because the lessons learned contributed to the requirements for a number of purpose-built preservation workflows to manage some elements of data recovery from physical carriers, including Prometheus23 and BitCurator,24, 25 and the development of purpose-built knowledge bases on carriers and file formats and their dependencies such as Mediapedia,26 as well as the National Library of Australia's Digital Preservation Knowledge Base.27 The need for institutional knowledge bases and registries continues to be relevant as indicated by recent projects undertaken by the Social Sciences and Humanities Research Council of Canada and the University of Illinois at Urbana-Champaign Library.28



Citation: The American Archivist 85, 2; 10.17723/2327-9702-85.2.359
Audit of Legacy Carriers
The initial phase of the project involved an audit of legacy physical carriers in the NAA collection.29 At the commencement of the project, the full spectrum of carrier types was not known and the audit attempted to identify all digital carriers existing in the collection. The audit was carried out by querying the NAA's descriptive catalog, an in-house developed archival management system called RecordSearch, for terms such as “disk,” “tape,” and “floppy” and by querying the series-level descriptor, Predominant Form, with the attribute “electronic record,” which picked up series whose predominant physical form was a digital carrier. Other sources of information, such as transfer documentation, were also checked. This work was carried out by an archivist, and the results, including any descriptive metadata captured in the catalog, were tabulated for action. Three hundred carriers were identified (see Table 1 and Appendix A, Figures 1–4). Even so, without a full, physical survey of the collection, it was impossible to confirm that all obsolete carriers were identified.



Citation: The American Archivist 85, 2; 10.17723/2327-9702-85.2.359



Citation: The American Archivist 85, 2; 10.17723/2327-9702-85.2.359



Citation: The American Archivist 85, 2; 10.17723/2327-9702-85.2.359

Relevant information such as series and item titles, date range, security classification, location, and any technical information was recorded in a register. Notable was the lack of technical information about the carriers, for example, almost no information about the creating application was recorded for any series, presumably not seen as relevant at the time of acquisition. Following the initial data collection, the carriers were physically checked, and any metadata or information located on the outside or stored with the carriers was recorded in the audit checklist (see Figures 2 and 3).
The audit captured as much descriptive information as possible both from the descriptive catalog and any information stored with the carriers, such as labels or other markings.
To reduce costs, a conscious decision was made at the outset to exclude digital materials described in the catalog as “backups,” as well as personal records collections, that is the official and private records of significant individuals who served within, or were closely associated with, the Australian Commonwealth government. Nevertheless, some of the materials eventually recovered from the carriers were subsequently found to be backups or duplicates of paper records, once again a failure of the transfer process to record this information when the knowledge of it was readily available. Twenty data cassettes dating from the late 1970s were also identified in the audit, however the source equipment required to deal with these types of carriers could not be sourced at the time of the project.
As the capability to undertake the work in-house was limited, the NAA issued a request for quote (RFQ) process for recovering data from the identified physical carriers. The RFQ outlined the proposed methodology for data recovery and also stipulated that two copies of the recovered data would be burned to Mitsui Gold brand 650Mb CD-Rs (optical discs), an industry standard at the time. One vendor was selected in 2003, and the data recovery was carried out between 2004 and 2005 in three phases. Subsequently, the recovered data were ingested into the NAA's digital archive when it came into production in 2007. Since that time, additional legacy carriers have been identified, either transferred to the NAA at a later time, or not identified in the original audit, having been obscured/hidden by information dis-association (one of the ten agents of deterioration/change).30 In addition, over 250 5.25" (5¼") and 3.5" (3½") floppy disks were identified in personal record collections at the time of the project but were excluded because seeking the agreement of the donor or the donor's estate could be time consuming. These floppy disks remain in personal records collections and still require data recovery for the data contained on them to be usable.
Methodology
The methodology adopted was a “belts and braces” approach designed to mitigate against the expectation of data loss resulting from the perceived instability of the obsolete carriers. A cautious approach was therefore adopted: in phase 1, a disk image was taken of the whole contents of the carrier (which included all the data on the disk, including unwanted and possible deleted content which had not been overwritten); and, in phase 2, access to the file system allowed the individual files (that were identified for recovery) to be extracted from their individual carriers. This process resulted in two copies of the same content (the disk images and the files), as well as backup copies of each. The proprietary processes developed by the provider, and the equipment and software used to extract the data, were recorded for each of the carrier types on Carrier Treatment Procedure Sheets, and the results of each of the treatments were recorded on a Carrier Treatment Check Sheet. The treatment procedure sheets and check sheets were very detailed templates developed by the project team that captured full treatment data to be able to prove the authenticity, integrity, chain of custody, and provenance of the recovered records.
Documenting the Recovery Process
The data captured on the Carrier Treatment Procedure Sheets and the Carrier Treatment Check Sheets were determined by the project team using a risk-based approach: more data captured about the process would reduce the risk of the evidential value of the records being questioned in the future. The procedure sheet for each process provided a full inventory of the hardware, software, and propriety processes used by the provider, including computers, hard disk drives, operating systems, network details, emulators, checksum algorithms, and so on, as well as a step-by-step description of the treatments for each phase of the project. Some of this information was proprietary to the data recovery vendor. This detailed metadata and descriptive information was designed to be ingested at a later date into a digital preservation or archival management system.
The check sheet was a spreadsheet listing each digital file/object recovered with descriptive and technical metadata, such as date recovered, operator, series, carrier ID, carrier label (if any), carrier type, carrier density, file name, file size, object type (i.e., format), character encoding (if known), and checksum created at the time of recovery, effectively the terminus post quem for proving fixity. These data were used in 2020 to confirm the integrity of the files before some of them were examined. These sheets enabled a consistent stratigraphic view of the relationship of the file system, the file tree, and individual files on each of the carriers. Interestingly, the metadata captured in the check sheets and procedure sheets later maps quite closely with elements of PREMIS, for example Object Characteristics, Environment, and Storage Medium.31
This thorough documentation of process constituted an audit trail of the recovery treatments and was a necessary activity to prove the authenticity and integrity of the recovered information, useful then and for future reference.
Phase 1
The aim of this phase of the project was to obtain exact bit-level images of the contents of each carrier. The disk images were created as a precautionary measure due to the age and potentially degraded state of the physical carriers and their data. To mitigate the risk that a carrier might fail during or just after the first attempt to access the data, this process created a whole disk image of each carrier using a one-read process, which was subsequently copied onto the CD-Rs.
Notwithstanding concerns about the age of the carriers and storage conditions, after the disk imaging process was completed, the legacy carriers were found to be quite stable. The results of this process identified
257 (86%) carriers with 100% data recovery;
14 (4.7%) with system or known duplicate data;
13 (4.3%) with partial recovery; and
15 (5%) failed the process (see Table 2 and Appendix B).

Phase 2
The aim of phase 2 was to extract individual files from their carriers into a format more acceptable for storage and future access. This phase consisted of copying all the viewable (not hidden) data objects that were able to be recovered from their respective original carrier and copying them onto Mitsui Gold brand 650Mb CD-Rs. As in phase 1, two CD-Rs containing recovered data—a master and a copy disk—were obtained.
As a result of the redundancy gained from phase 1, the process of extracting the native file system and contents using a multiple read process upon each carrier could be conducted with less concern for damage to the original carrier (i.e., the bit-level image copies of all carriers provided redundancy). Another result of phase 2 was that some of the file systems and digital objects could be examined, although most could not be opened due to inherent software dependencies. After the file extraction process was completed, the finer granularity resulted in a different outcome compared with the results obtained in phase 1 (see Table 3 and Appendix C, Tables 1–3). The results after phase 2 were
245 (81.9%) carriers with 100% data recovery;
20 (6.7%) system or known duplicate data;
14 (4.7%) with partial recovery;
14 (4.7%) found to be blank; and
6 (2%) failed the process.

Comparison of results between phases 1 and 2 is shown in Figure 2. As mentioned, the difference between the results obtained in phases 1 and 2 is due to the granularity of the processes, operator observations, and access to the file system. Differences in the results were also due to
More carriers being identified as containing system data or duplicate data;
More carriers that were initially identified as failing but on investigation of the file system were subsequently found to be blank. If a disk was formatted but contained no data, it was still imaged in phase 1; and
Lack of clarity about what the vendor defined as a carrier failure in phase 1, when in phase 2 some data were found to have been recovered from the “failed” carrier. It was subsequently apparent that data had been partially recovered from Series AA1979/319/032 and C379p133 before the carrier failed.
Even considering these factors, the results were of an order of magnitude better than expectations at the start of the project that only 30% of data could be recovered.
Phase 3
The aim of phase 3 was to examine the results obtained from phase 2 and cull duplicate files, system files, and blank carriers. This phase also included determining other problems such as the presence of data objects not identified in the transfer documentation (e.g., data that were found on the carrier in addition to the records identified for transfer to the NAA). Problem formats such as proprietary Landsat34 data formats were also identified. On the basis of file name, file type, and the operator comments recorded on the Carrier Treatment Check Sheets, decisions could be made about triaging digital objects for preservation actions (see Table 4 and Appendix D). It should be noted that, at this stage, most of the individual file content had not been examined in any detail. However, as much of the content, especially on the older carriers, was encoded in ASCII35 or EBCDIC,36 some of the content could be accessed at the time easily with simple text editors and proprietary file analysis software used by a vendor called InterMedia.37 Therefore, while the work undertaken in phase 3 demonstrated that the recovered data were usable, it was understood that further work was required to effectively render the data and provide access to them. In particular, data encoded in proprietary database formats and unknown geophysics formats, such as the Landsat data, would require further analysis to understand and identify options for accessing the recovered files.

The results from phase 3 also revealed the magnetic tapes contained identifiable information on each carrier, such as header and footer files, used by the original creating/accessing software, that did not affect the meaning of the content or the ability to access the content using current software applications. There was debate within the NAA about whether this information needed to be retained or securely disposed of. Given the uncertainty surrounding the future research value of this information, and probable technical dependencies to access and render the data, the decision was made to retain it. It is also worth noting the total amount of data recovered from the process was relatively small, and the amount of data recovered from more modern carriers was likely to be exponentially larger.
Phase 4
The aim of phase 4 was to document processes, make recommendations, and provide lessons learned for future archival workflows at the NAA. The outcome of this phase was an internal NAA report, “Options Paper to Determine How to Proceed with the Legacy Media Project,”38 which provided a good deal of analysis and statistics of methods and outcomes, which have formed the basis of this article. It also provided recovery costs per carrier and per gigabyte, which provided a basis to determine resource requirements for future data recovery projects. Given the high cost associated with relatively small volumes of recovered data, the NAA developed transfer requirements for digital records that prohibit the transfer of legacy or obsolete carriers.39 As part of a broader management regime for digital recordkeeping and information management, agencies will need to decommission systems and migrate data in a timely way to prevent software and carrier obsolescence from occurring.40 Another outcome of the phase 4 work was that the amount of detail required to populate the procedure sheets and check sheets was too resource-intensive for a human operator and that machine-generated technical metadata was the preferred option going forward.
Results
The results of each phase of the project are tabulated in Appendixes B, C, and D. They indicate a high level of success for recovering data from carriers of this age and type. Broadly, of the 300 carriers treated, 257 (86%) achieved 100% reads in phase 1 (i.e., disk images were obtained), and 245 (82%) achieved 100% reads in phase 2 (i.e., complete digital object recovery). Partial recovery of files was achieved in about 5% of cases. A similar result was found in a slightly later project at the British Library involving data recovery from 8", 5¼", and 3½" floppy disks: “ . . . there have been relatively few cases where disks have been entirely unreadable: occasionally degradation can be seen in the physical condition of the disk, ie a light reddish brown surface indicative of oxidisation.”41 Additionally, in the case of the NAA project, analysis of the recovered data indicated that about 5% of the carriers were blank, and about 7% of the carriers contained data duplicated in another form, such as paper printouts. Not surprisingly, the blank carriers were from personnel computers and not part of a centralized corporate IT function and showed that they had not been examined by the agency or the NAA before taken into custody.
The 8" and 5¼" floppy disks achieved 100% data recovery. Data on seventeen 3½" floppy disks could not be recovered, while twelve 9-track ½" magnetic tapes could not be read. These carrier failures belonged to two series,42 and the carrier degradation may relate to how those series were managed and stored prior to transfer to the NAA. Luckily, none of the disks with spanned backup data had failed; if any had, it would have rendered any future recovery process most likely impossible, given that all the data on each consecutive carrier are required in the order of the original backup process. However, rendering these data is still problematic, as without access to the original backup software (if known and/or available), the bits currently cannot be rendered in a meaningful way.
Rendering the Recovered Digital Files
A fifth phase of the project was envisaged but not acted on at the time due to changed organizational priorities. In this phase, rendering or interpretation software was to be identified and used to obtain a usable copy of the files obtained in phase 2 that could be characterized, preserved, and rendered by the NAA digital preservation software. Although phase 5 was not enacted, the fact that the disk images and files were appropriately preserved and stored to ensure their integrity means that this phase can be commenced at any time.
For example, in early 2020, the NAA revisited the Landsat data recovered from two 9-track ½" magnetic tapes, which were part of the 1983 Royal Commission into the use and effects of chemical agents on Australian personnel in Vietnam (see Figures 3 and 4).43Figure 4 shows one of the bit sequences from the files opened in a modern hex editor, but very little information about the format can be extracted. These recovered files were sent to the US Geological Survey (USGS), the organization responsible for Landsat, which was able to extract the image data and display them using ENVI,44 widely used image analysis software. Note that the recovered images are not perfect, as there is a significant offset between band 1 and the other bands that the USGS was not able resolve, and there were “garbage” artifacts of some sort on the ends of each line. Nevertheless, as shown in Figure 5, the images were recoverable, and the individual bands exported as TIFF files, which can also be preserved and accessed. This example shows that files recovered during phases 1 and 2 in 2004–2005, properly stored and managed, can be effectively rendered by analysis and rendering software in 2020. Extracting the bits from obsolete carriers and processing them through robust digital preservation processes has allowed data to be readily available forty years after it had been created in a completely different technological access environment. Today, there are open-source preservation work-flows like BitCurator45 and hardware like Kryoflux,46 but these technologies do not allow access to the original carriers without some form of working hardware interface and readable carrier (e.g., bits cannot be removed from physical carriers without the correct access technology), however one might wish it not to be the case!



Citation: The American Archivist 85, 2; 10.17723/2327-9702-85.2.359



Citation: The American Archivist 85, 2; 10.17723/2327-9702-85.2.359
Future work on the recovered files at the NAA could focus on emulation techniques to reconstruct the performance of the digital records when they were in active use, such as those proposed by the University of Freiburg's bwFLA and the current Emulation as a Service Infrastructure (EaaSI) Project using either the files or the disk images.47 For example, files recovered from 9-track ½" magnetic tape relating to the Costigan Royal Commission into the Painters and Dockers Union (1980–1984) could provide important insights into early computerized records and information management practices.48 Costigan was the first Royal Commission in Australia to use a computer information management system to manage and provide access to a wide range of investigative materials. The system allowed names and crimes to be cross-checked and referenced; and, according to Scott Prasser: “The Costigan Commission pioneered the use of computerised data to trace connections between personnel and transactions.”49 Very little information is available about this information management system; almost no information about the computer system is recorded in transfer or series documentation. However, this is a very fundamental archival expectation: future users of these records will want to understand how they were created, managed, and accessed, and how computerization contributed to what was an extraordinarily controversial public inquiry. Emulation may provide an opportunity to access and understand the recovered files in their original computer environment without altering the files and thus the fixity and provenance data.
The project also raised some important issues and had lessons for future policy development at the NAA.
Descriptive and Technical Metadata
A key finding of the project was the need to capture detailed information about the carrier and its hardware and software dependencies at the point of transfer into the custody of the archive, for example, the specific descriptive and technical metadata, such as carrier type and version, required drives, operating systems, and other software dependencies.
The large amount of audit and integrity metadata recorded on the Carrier Treatment Procedure Sheet and the Carrier Treatment Check Sheet was resource intensive, and metadata acquisition was not very scalable without automatic tools and therefore costly. Although an audit trail of treatment is essential, there is a tradeoff in cost, capture (supplied, extracted at the point of ingest, or derived from the file afterward), storage, and management and consistency of the metadata. In this case, although such metadata was developed well before current metadata standards were accepted, it still has high utility (and maps to the later standards). Standards such PREMIS may provide a baseline of essential audit metadata for projects such as this at the NAA, but they require fit-for-purpose tools and trustworthy registries of technical information to ensure preservation metadata is systematically captured and can be understood and used over time.50
A related finding was that the NAA's metadata standard for archival control, the Australian Series System,51 was not designed for the management of digital carriers; additional metadata is required to ensure the accessibility of the digital content into the future. Metadata captured at the point of transfer must include additional information, such as the combination of software and hardware used to create and manage government data, as these data might not be extractable or derived from the file metadata. This information is essential for understanding the context of digital records when in active use and, if necessary, for data recovery and also for future access if, for example, an emulation strategy is used. It is worth noting that the development of descriptive standards in the software preservation domain highlights the need for similar technical properties to be captured by collecting institutions.52
Another key finding was that a significant amount of archival description work on the recovered data was necessary to facilitate discovery, to understand their context, and to document their relationship to other records, in particular analog records. In many cases, the carrier was transferred into the custody of the NAA with computer printouts and other paper records, but because the data could not be accessed, it was not possible to determine the relationships between different records, including whether or not the data were duplicated in paper form.
Commonwealth Government-wide Issue of Legacy Digital Carriers
There was and continues to be an urgent need to understand the scope of the legacy carrier problem in Australian Commonwealth government agencies. No audit of legacy carriers in agencies has been undertaken, so very little information is available to quantify the risks of data loss.
Since 2011, the NAA has issued a series of rolling government-wide policies, in effect rolling five-year plans, to push government agencies along in digital transition, that include actions, targets, and pathways; online self-assessment kits; annual surveys; and other means to measure progress. The first of these, the Digital Transition Policy, was developed by the Department of Prime Minister and Cabinet, with the NAA as the lead agency, and released in 2011.53 The Digital Continuity 2020 Policy was released in 2015, and the latest policy, Building Trust in the Public Record, came into effect in 2021.54 The current policy emphasizes digital preservation and the risks associated with legacy information assets, including assets stored on obsolete or legacy carriers. A release schedule of products developed by the NAA for government agencies includes advice on identifying, managing, and disposing of legacy information assets.
Managing Legacy Carriers in Custody
Accessing data on legacy carriers is still an issue for the NAA. The project described in this article recovered data from 300 carriers (described in Appendix A, Figures 1–4), but excluded carriers in personal records collections and some carrier types, such as data cartridges. The current register of obsolete carriers includes over 250 items, mainly comprising 5.25" floppy disks in personal records collections. In addition, legacy carriers are still being found in paper files and will continue to be transferred to the NAA in this way. Data recovery tools such as Kryoflux, BitCurator, and others will probably form part of an in-house approach to deal with data recovery of remaining carriers in a more cost-effective way than the outsourced approach adopted in the 2003–2006 project.
Access and Delivery
Providing meaningful access to the recovered data remains a pressing issue. Preservation actions, such as identifying suitable migration paths for the recovered files, even if the format and format version can be identified, in many cases may be difficult or impractical. For example, even records in formats created using early word processing software such as WordStar and Corel WordPerfect are not easy to render in modern software, and studies about conducting the preservation actions necessary for accessing the content suggest that migration is also problematic (complex, costly, time consuming, etc.).55 However, advances in scalable emulation services may provide a more viable means of meaningfully accessing complex data in obsolete formats over time.
Of course, if data are lost, access is not possible. The mechanisms and the extraction process of data from legacy carriers is like a game of Russian roulette, it may or may not work, or only partially work at the time of processing, and the process may destroy the carrier in some instances. However, the longer the material is left unattended on the carrier, the more problematic it will be. Recovering the bits from legacy carriers in a timely manner remains the critical risk mitigation for catastrophic loss.
2023
The digital archivist logs into the workbench and calls up the emulation environment. Today, she is working on a group of files that had been removed from twenty-five 5.25" floppy disks almost twenty years before—records from the Royal Commission into British nuclear tests in Australia in the 1950s and 1960s, an important group of records about events that had long-term consequences for the Indigenous people who had lived there. The series information recorded in the archival control system gives no information about the operating system and application software or the hardware on which they ran, a serious gap in the information gathered about the disks when they were transferred into custody in 1985. Fortunately, useful technical metadata resulting from the data recovery project provides some guidance, the rest is her job to figure out and determine how to make these records accessible to the public. She carries out a fixity check to ensure the data files are the same as at the time of their recovery from the disks, and commences work. . . .
The Legacy Media Project is a case study with many lessons learned. It informed policy decisions, for example the transfer policy, which listed the types of carriers the NAA would accept. The project also gathered important metrics on data recovery, including costs and rates of recovery by carrier type (see Appendixes B, C, and D). The success of the project in recovering data of national significance confirmed that the belts and braces approach adopted was warranted; although recovery processes and tools are very different today, they are still based on the need to ensure the authenticity, trustworthiness, and integrity of data. Similarly, at the time of the project, metadata standards for the preservation of digital records were in their infancy and not widely used, consequently the project developed what was in effect a default standard for managing the recovered data that is still useful today (i.e., the working assumption is that it is better to have some metadata, even though it does not completely conform to modern standards, than to have none). Additionally, the project highlighted the need for an archival approach to data recovery, which led to the creation of or influenced a number of software tools and knowledge bases that are still relevant in 2022. Therefore, the discussion on the antiquity of digital process “history” is important to understand the development of digital forensics and preservation in the field of archival and library science, which is rightly considerably different today, as well as to provide a benchmark for future research.
Although the prospective fifth and final phase of the project—to provide the means to meaningfully render the recovered bits—was not commenced at the time, the fact that the bits were recovered and that metadata was extracted, derived, or provided by an operator and has been managed and stored according to early digital preservation principles means that the files can still be rendered. The risk-based approach to data recovery involving extracting multiple copies of the data and recording detailed information about the process, while resource intensive, may prove critical; for example, the disk images obtained in phase 1 could be critical for accessing data via emulation strategies. The key message is that recovering and preserving the bits while the opportunity for recovery exists is essential for future access.
If a recovery project was carried out today on the same carriers, the results would doubtless be different, even assuming the working instances of the access technology were available and the data were recoverable. The bits would still be the same bits (assuming that they had not degraded), but most of the metadata extracted would be the result of more automated processes and no longer the results of an artisan activity, but part of a more industrial process as described by Peter McKinney.56
In her book, Romances of the Archive, the American academic Suzanne Keen explores the many representations of archival research in twentieth-century fiction, from the ghost stories of H. P. Lovecraft and M. R. James, in which the labors of unwitting antiquarians unlock bogeys from the distant past, to the detective stories of Colin Dexter and P. D. James, in which insoluble crimes are solved in police archives by intrepid detectives, to the literary revelations of A. S. Byatt's Possession, in which academic research in manuscript archives unlocks startling literary secrets.57 The tale told in this article has also uncovered various chimeras, bogeys, and revelations in the data unlocked from the obsolete carriers of the past—and looks forward to new revelations in the future. For that reason, it is a tale worth telling from a different but not so distant past. This is one of the tales from THE disK FILES.



Timeline of the project phases, timeframes, deliverables, and staff

Comparative results after phases 1 and 2

An image of the 9-track ½" magnetic tape that stored the original data (scale in cm). Descriptive metadata labels on the tape include the NAA accession numbers “C1281_2” and “CRS C1281”; information about the collection of the data on the tape (for details of the tape label, see Figure 4); and technical metadata about the manufacturer and the carrier type and capacity: “Wabash Quadronix 1” and “Certified For Use Up To 6250 bpi (bits per inch).” This type of information was collected and entered into the Carrier Treatment Check Sheets.

Close up of details on the faded tape label: “[undecipherable logo] THAILAND REMOTE SENSING CENTER/ LANDSAT 4 MSS BAND 4, 5, 6, 7, Data/WRS PATH 125 ROW 53 DATE 01/06/83/SCENE ID 4032002435/FORMAT TYPE CCRS-AM/TAPE DENSITY 9 TK 1600 PPI (pixels per inch) SEQ 1 OF 1/ 196 Phahonyothin Rd. Bangkok BKK 10 00/ Te 5790116 TEL. X 82213 NRCTRSD TH/Date received from TLS 30/01/84.” This type of information was collected and entered into the Carrier Treatment Check Sheets.

An image of the bit sequence from one of the Landsat files rendered via the hex editor in 2020

The same image in Figure 3, now presented as a rendered TIFF file in 2020 (image process by US Geological Survey)

Details of 3 1/2" floppy disk types in the data recovery project (including brands, format density, etc.)

Details of Burroughs B20 5 1/4" floppy disk types used in the data recovery project (including brands, format density, etc.)

Details of Wang 8" floppy disk types used in the data recovery project (including brands, format density, etc.)

Details of 1/2" magnetic tape (7- and 9-Track) types used in the data recovery project (including brands, format density, etc.)