Source: The Signal Digital Preservation
By Bill LeFurgy
The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.
I’m inclined to blame the semantic flexibility of the word “archive” for the fact that someone with no previous exposure to web archives might variously suppose that they are: the result of saving web pages from the browser, institutions acting as repositories for web resources, a navigational feature of some websites allowing for browsing of past content, online storage platforms imagined to be more durable than the web itself, or, simply, “the Wayback Machine.” For as many policies and practices guide cultural heritage institutions’ approaches to web archiving, however, the “web archives” that they create and preserve are remarkably consistent.
What are web archives, exactly?
WARC, WestAfricanResearchCenter, by Robin, on Flickr
At the most basic level, web archives are one of two closely-related container file formats for web content: the Web Archive Container format or its precursor, the ARchive Container format. A quick perusal of the data formats used by the international web archiving community shows a strong predominance of WARC and/or ARC. The ratification of WARC as an ISO standard in 2009 made it an even more attractive preservation format, though both WARC and ARC had been de-facto standards since well before then. First used in 1996, the ARC format is more specifically described by the sustainability of digital formats website as the “Internet Archive ARC file format”, a testament both to the out-sized contribution of the Internet Archive to the web archiving field as well as the recentness of the community’s broadening membership.
This extensive technical metadata is what distinguishes a web archive from, say, a copy of a web page. Aside from testifying to the provenance and facilitating temporal browsing of the archived data, the variety and ubiquity of record headers also creates intriguing opportunities for metadata extraction and analysis.
Lego Bin, by Josh Hallett, on Flickr
If you want to see for yourself, an appendix to the draft WARC specification contains examples of each of the WARC record types, including archived resources. Internet Archive also provides a set of test WARC files for download. Since even archived binary data is stored as (Base 60- encoded)ASCII text, the files are surprisingly legible once unzipped and opened in a text editor. It’s not as seamless a way to navigate the past web as, say, Wayback Machine or Memento, but it will give a deeper understanding of the well-considered and widely-used data structure that makes those technologies work.
At MAM-A Inc. We offer recordable media that is more reliable and longer lasting than any other recording media available today.
MAM-A Inc. is the optical media specialist. We provide the best high quality 24kt pure Gold archive grade recordable media which offers superior longevity.
Our exclusive online store offers customers a comprehensive selection of the latest recordable media, including CD-R, DVD-R/+R, DVD+R DL, M-DISC, UDO & BD-R/RE.
MAM-A Designed to Last