Website archivability

Nowadays, website developers and designers need to take into account criteria such as accessibility, performance, SEO – Search Engine Optimization, compatibility with W3C standards e usability. Of the good practices coming to the forefront in this sector, one to take into particular consideration is website archivability. Archivability is a term that brings together the various characteristics that the content, structure, functionality and interfaces of a website need to have to allow the long-term preservation and accessibility of a website with contemporary web archiving tools.

 


Guidelines for making archivable websites

  1. Ensure that the structure of the website is in line with leading accessibility standards
    Alignment with leading accessibility standards ensures both the usability of the site for all categories of user and accessibility for Heritrix archiving web crawlers.
    For further information: The WC3 Web Accessibility Initiative (WAI).
  2. Maintain stable URLs for important content, and redirect to new URLs only when necessary
    Maintaining stable links ensures that users can easily navigate different versions of an archived website over time and that bookmarks and content shared via social media continue to be accessible.
  3. Provide the website with an XML and/or RSS protocol sitemap
    Providing an XML or RSS – RDF Site Summary or Really Simple Syndication Sitemap allows search engines to access all the resources contained within the website, including pages that use Flash or JavaScript navigation, which tend to hide links, and to show crawlers from archiving organisations the content to be either included or excluded from collection.
  4. Associate an HTML/XHTML link to each item of website content (pages, images, video, documents)
    It is advisable to avoid JavaScript or Flash content, particularly on the homepage, as the reconstruction of website addresses generated dynamically via JavaScript often results in the creation of non-existent addresses (Error 404). It is also important to remember that as of 1 January 2021, Adobe no longer provides support for Flash technology.
    In any case, it is advisable to also provide an HTML or XHTML text description of non-text content to facilitate both indexing by crawlers and subsequent full-text archive searches.
  5. Remove robots.txt restrictions or limit them to areas that are not required for archiving purposes
    The use of robots.txt to exclude directories containing scripts and stylesheets, which generally does not affect web-page indexing by search-engine crawlers, may however hinder the proper viewing of certain essential resources in archived websites.
    By linking a robots.txt file to an XML sitemap, a website manager can decide what content to include in or exclude from crawler archiving activities.
    If an open-source content management system (CMS) is used, make sure the configuration of robots.txt files are updated to allow access to the Archive-it bot: archive.org_bot
  6. Avoid proprietary formats for important content, especially the home page
    The use of open standards and formats ensures the long-term accessibility of content, simplifying the archiving and redistribution of content by archiving organisations.
    Make sure that main content is published in consolidated and well-documented open formats that are, whenever possible, issued with Creative Commons licenses.
  7. Limit the use of content held in third-party websites
    Wherever possible, make sure that video, audio and other similar content is incorporated into your own website or web page, rather than being exclusively held in third-party websites: crawling software is not always able to associate content from external websites to the website in which they are used.
  8. Use unambiguous website addresses that contain information on the content
    If your content management system (CMS) allows, configure it so that website addresses include the publication date and at least an abbreviated version of the content title.
    The use of specific page titles and <META> description elements not only improves the presentation of search results, but also allows archiving organisations to define access points and descriptive resource records. A website address that communicates information on the content of the resource provides additional material that can be used to identify the new position of lost sites as well as any previously unidentified versions in the archive.
    Also make sure that the date of publication or of the most recent update is included via an HTTP Last-Modified header, which confirms the content and helps users understand when it was published.
    Configuring web servers to provide unambiguous and reliable HTTP status codes will facilitate detection and reduce superfluous crawler requests to a minimum, improving the interoperability of the website for both users and search engines.
  9. Indicate the type of media and character encoding used
    Use a content-type HTML meta tag or XML doctype declaration in the HTTP header to indicate the type of encoding to be used to properly view content: this will allow the browser to understand the page content, facilitating indexing.
    Indicating media types also helps the browser understand which files should be processed directly and which are to be delegated to other support applications.

Evaluation

The ArchiveReady evaluation tool can be used to assess whether a website responds to archivability criteria.


Useful links and sources