Resources for Information Management and Preservation Digitising Documents

It’s easy to de-prioritise dealing with a paper-based collection of reports, letters, books, press clippings, images and more that has accumulated over the years. But from improving accessibility to enhancing security, there are some good reasons why it’s worth putting in the effort to digitise a paper-based archive.

Digitising essentially means scanning a physical document to create a digital copy that can be stored digitally. A typical digitisation process will most likely include these steps:

  1. Define the goal and scope
  2. Assess what is in your archive
  3. Determine how documents will be accessed
  4. Identify and locate the human resources and equipment
  5. Test a sample of your documents
  6. Scan, store and index documents

1. Define the goal and scope

There are four main reasons to digitise: to improve accessibility, to enhance security, to save space and allow for sharing.

  1. Improving accessibility: Finding the right document in a roomful of paper is challenging. Digitisation can help because you can place your documentation on an online or offline server, or in a private or public digital library, making it easily accessible to staff or the public. Digitisation will inevitably require you to re-organise your documents in a better way, using a file naming convention to rename documents, making an inventory, grouping them into collections, and deciding on metadata such as categories or tags.
  2. Improving security: Do you run the risk that malevolent groups may seek to destroy or confiscate your documents? Do your documents contain sensitive information on sources or witnesses who need to be protected? Or are your documents subject to storage conditions (humidity, insects, rodents) that make digital preservation a necessity? If the answer is yes to any of these questions, it will be worth the time to digitise your documents, or at least part of them.
  3. Saving space: If your offices are overflowing with boxes and folders, digitising may be a way of saving office space. (Note: it may not necessarily be the more cost-effective solution compared to physical storage of paper documents, and in many cases, simply weeding out unnecessary documents will also save a lot of space.)
  4. Allow for sharing: Your work may benefit from being able to easily share documents with colleagues within your organisation, with other organisations, or with a wider audience.

The goal is a short definition of the purpose of the project. Is it to protect sensitive information? To share litigation correspondence? To preserve valuable information on violations collected decades ago, for a future truth commission? To make key documents available to the public in a digital library? This statement of goal should be agreed upon and supported by key stakeholders at the outset of the project.

You do not necessarily need to digitise all your documents. To identify the scope of this project, it is important to define what you need to digitise and why, based on the purposes listed above. For example, you may want to digitise:

  • Only documents that contain sensitive information.
  • A selection of documents that you need to publish on your website or in a digital library.
  • A collection of documents of historical value that need to be preserved.
  • All testimonies and interviews, so that they can be stored in a digital vault, and the originals destroyed.
  • Court decisions from your older litigation cases so that a complete collection can be published online.

2. Assess what is in your archive

Start with a practical understanding of what documents you have. This is the kind of information about your archive that you will want to explore:

  • Quantity: If you have piles of papers, one way to identify how many documents you have is to count how many meters of documents you have, and then sample some shelves or boxes to count the average number of pages per meter or box.For example: Your documents are stored in 50 archiving boxes, with an average 2,000 double-sided pages per box. Therefore then you have 50 x 2,000 = 100,000 pages.For example: Your documents are stored vertically on 10 meters of shelving, with approximately 100 single-sided pages per centimetre, then you have 10 x 100 x 100 = 100,000 pages.
  • Classification system: It is helpful to understand the existing classification system for the paper documents. For example, the documents may be organised by year, then by region, then by event. Or by theme, then by year, then by investigation. Or by year, then by case.
  • Physical quality: Review a sample of the documentation or at least 10% to determine the physical state of the documentation. For example, what portion was damaged by humidity or by rodents?
  • Types of documents: Are these documents letters, thematic reports, annual reports, interviews, testimonies, books, investigation files, litigation case files, periodicals, grey literature, press clippings…?
  • Ownership: Are the documents created by and the property of the organisation which holds them, or was part of the collection obtained from other organisations? In the latter case, you will have to investigate whether there were conditions attached with regard to the publication and confidentiality of the material.
  • Document retention requirements: Are there documents which should be retained for legal reasons? For example, national law may require you to keep original (paper) signed documents from your litigation cases, or financial records for a period of 10 years.

It is important to actually physically work with the documentation to make this assessment. You may be surprised at what has accumulated over the years or what is missing and needs to be located. These findings should be described in a concise but very precise document, which will be a useful basis for further decisions. We are sharing a few examples of our own archive assessment documents so you can see the kind of detail that’s helpful:

  1. This is a document we prepared for the International Commission of Jurists for their digitisation project (2012). HURIDOCS assisted the ICJ by scanning and digitising over 800 publications from the period 1952 to 2007.
  2. This is a short assessment we prepared for the KontraS (The Commission for the Disappeared and Victims of Violence) archive of human rights documents (2012). The KontraS archive was made available online in March 2016.

3. Determine how the documents will be accessed

Once all the documents are scanned, you’ll need a way for people to access them, whether that is just for your team or for the public. This could mean creating a simple shared folder on a particular server, implementing a software programme that allows you to build and share a collection of documents (like the HURIDOCS-developed tool Uwazi), or partnering with an institution that curates human rights archives such as Open Society Archives or Duke University Human Rights Archive.

Whether your documents will be accessed publicly or limited to your team, you will want to develop a file naming convention that can be implemented as they are scanned.

4. Identify and obtain the human resources and equipment you’ll need

Now that you have a better understanding of the files in your physical archive, and you know the goal and scope for this project, you can begin to identify what resources it will take to carry out your digitisation effort.

First and foremost, you will need to select and purchase the right scanner for your project. What to look for in a scanner:

  • It should have a high duty cycle, meaning you should be able to scan all day, five days a week, without the scanner overheating.
  • It should have a feeder tray, allowing you to scan 50- or 100-page documents in one go.
  • It should be duplex, meaning it can scan double-sided documents.
  • It should be fast – meaning at least 30 pages per minute.

Will your project require optical character recognition (OCR) software? OCR software will help your computer analyse printed text and translate it into something it can process. It is particularly helpful for digital archives of documents because it allows you to search for text within scanned documents. Some scanners come with a version of this software, but your project might require something more robust.

How many people will be working on the project? Digitisation is a time-consuming task. It’s possible that you may need two people per scanner (one to feed the documents, and one to name the digital files). Depending on your timeframe, you may need to dedicate more human resources to the task.

Like any other organisational activity, a digitisation project needs to be planned, managed and budgeted, and it must receive an appropriate investment of time and/or resources to succeed.

5. Test a sample of your documents

Given all the information you have collected so far, you may be able to predict that if, for example, you have 200,000 pages to scan and your scanner can realistically scan about 2,000 pages per day, then it will take 100 days with one person and a scanner and computer. Or if you work with two teams, each with a scanner and computer, it will take 50 days.

But this example is only a guess and assumes optimal conditions. The real amount of time may be much higher. It will depend on the quality of the paper, the quality of the print, the speed of the scanner and computer, and the speed and accuracy of your staff. For example, abnormal documents may need to be scanned page-by-page manually, while normal A4 paper can be more easily processed via the feeder tray.

At the end of the day, only a pilot test with a sample of your documents will tell you really how long it will take to scan and name.

6. Scan, store and index documents

Now you’re ready to get started! As you begin the scanning phase, here are a few things to keep in mind:

Naming documents: Remember, each document is to be given a systematic name, using the file naming convention that you developed. The names are to be attributed at the time of scanning the document.

Adding metadata: Metadata refers to the descriptive information stored with your digitised documents, such as author, date of creation or thematic categories. Don’t skip this step: having good metadata will help you and others find your documents in the future.

Before scanning:

  • Check whether there are pages with text on recto and verso (pages with text on both front and back pages)
  • Check whether there are pages on particularly thin paper – these should be scanned one by one
  • Check whether there are pages in format bigger than A4 – if so, reduce the format (either the scanner has this option or elsewise use a photocopier)
  • When you are scanning books, check whether there are still multiple copies available. If so, you can delete the binding and scan using the feeder tray, which is much quicker than scanning page by page.

After scanning:

  • Ensure that the documents are re-stapled correctly, immediately after scanning
  • Ensure that the physical documents are returned shortly after scanning to the appropriate folder
  • Ensure that folders are returned shortly after scanning to the correct shelf

Further resources