Web archiving

As part of the Magazzini Digitali service, the National central library of Florence gathers, conserves and provides permanent access to internet content of Italian cultural and historical interest.


The project

Websites and the documentation they contain are considered digital “ephemera”, and anyone who has surfed the internet will have experienced “broken links” and the resulting 404 error. However, it is undeniable that they have also become an essential source of information for contemporary history and culture.

On this basis, in 2018, as part of the wider-ranging long-term service for conservation and access to digital publications, the BncF launched a web archiving programme similar and complementary to the efforts of large-scale organisations dedicated to memory from all over the world.

On the basis of the provisions of the law regarding the “Legal deposit of documentation of cultural interest destined for public use” (Law 106/2004 and Presidential Decree 252/2006), the main focus is on the gathering of:

  • documentation and websites that ensure continuity for the previously established collections, including on traditional media and via traditional forms of technology.
  • documentation and websites concerning the scientific output of universities, research centres and cultural institutions.
  • documentation and websites created and published on the internet by public entities.

The library uses the Archive-it platform for harvesting and accessing archived websites.

Save for specific requirements, harvesting usually takes place a couple of times a year.


How to participate

In Italy, the legal deposit of documents disseminated via digital network is not a mandatory requirement and therefore participation in the programme is on a voluntary basis.

Requests to participate should be sent via email to bnc-fi.magazzinidigitali@cultura.gov.it, and if the relative resources are deemed to be suitable for archiving, applicants will be required to fill in the dedicated online form.

The library reserves the right to subsequently contact participating organisations and institutions to define harvesting and assess technical requirements.


Technical requirements for harvesting

In order to allow harvesting (automatic collection), websites must:

  • grant access to the Archive-it crawlers: archive.org_bot.
  • in cases in which robots.txt exclusion protocol is set up, provide for exceptions for the aforementioned bots.

The following measures are also recommended:

  • Bring together publications of cultural interest on a single page and/or directory within the website (e.g. “Publications”, or in uniform subsections, e.g. “Mobility” > “Documentation”, “Social services > “Documentation”), which not only facilitates searches and access by general website users, but also speeds up the selection, harvesting and application of metadata to material for conservation purposes.
    Sitemap protocol may also be used to provide Archive-it crawlers with more precise indications on which pages should be scanned;
  • Use uniform file naming that reflects the content and/or other related documentation (e.g. different series of a particular magazine, issues in a series…).
  • Avoid publishing multiple versions of files in different areas of the website, favouring the use of internal links.

Limits to harvesting

  • The harvesting of websites or sections of websites to which access is restricted is possible if the BncF is provided with the relative credentials; harvesting is not possible if the website uses  CAPTCHA.
  • Websites and/or sections of websites produced with Flash or JavaScript, which are notoriously difficult to index by search engines that do not recognise languages other than HTML, cannot, for the same reasons, be harvested by current technology. The use of these platforms is not recommended.
  • Documentation provided for viewing via an integrated viewer on the website (e.g. Sfogliami.it, PressReader, etc…) may be harvested but is almost never available for viewing with the current replay systems used by Archive-it.
    In cases in which, for reasons of access, these platforms need to be maintained, a downloadable version of the documentation or an alternative deposit method should be provided.

Website archivability

The library has drawn up a list of criteria for Website archivability, drawing on good practices widely used by organisations dedicated to memory all over the world.

These criteria will become required measures with the implementation of legislation regarding the legal deposit of documentation disseminated via digital network.


Access to collections

Archived websites are organised into collections by theme, as part of the wider-ranging BncF collection on Archive-it::

    1. Association
    2. Domain .it (2006)
    3. Research organisations and institutions
    4. Cultural organisations and institutions
    5. Institutions belonging to the Ministry of Culture (previously the Ministry of Cultural Heritage and Activities
    6. Open Access Books
    7. Open Access Journal
    8. Professional registries and associations
    9. Public administration
    10. Local history
    11. News publications and websites

When filling in the form to request participation in the service, website owners can choose whether to allow public access from any online terminal or to restrict access exclusively to the BncF internal network.


Useful links


Online contributions in Italian

The following list is partial and is constantly growing.

2023

2022

2020
Web archiving e pandemia

2019

2018

2006


Contacts

Enquiries can be made by writing to or calling:
Chiara Storti | Resp. Magazzini Digitali e Web Archiving
bnc-fi.magazzinidigitali@cultura.gov.it
chiara.storti@cultura.gov.it
tel. 055 24919 73