5. Creating your data dictionary

What is this?Who is this for?How was it created?
This resource is intended to help you determine the key questions you want your database to answer, based on your goals and user needs [Read more]This resource is for human rights defenders who are documenting violations in their communities. [Read more]This resource was created by human rights defenders. Anyone can suggest changes. Ideas that need some expansion are flagged with a sprout.🌱 [Read more]

How to use this resource #

If you have gone through the process of designing your conceptual data model and developing your controlled list of terms, it is important to document the terminology and data standards that you have set out for the purposes of managing your information. In this resource, we will introduce you to what a data dictionary is and why it is important to the long-term health of your database.

Our intention with this resource is to introduce you to a way of thinking about documenting the rationale underlying the decisions you have made about the design of your database. While it may seem a tedious step to document all of the terminology and data standards you have chosen, it will ultimately be useful to future users of your database.

What is a data dictionary and why is it important? #

A data dictionary is essentially a rulebook for: 

  • what data to include in your database, 
  • how it should be structured, 
  • how it should be entered into the database, and 
  • how it should be accessed. 

This dictionary will be a repository of names, definitions, and attributes that provide contextual information about the data in your database. The dictionary describes each database field with a clear definition of what information is captured in that field and the rules for utilising that field.  

When you begin to develop your data dictionary and determine the elements you should include, it is important to consider what would be important for someone who is not familiar with your database to know about the data held within. It should be the go-to tool for anyone to understand everything about your data set. 

A data dictionary is critical to the sustainability of any database, especially when it is used by multiple people. Perhaps you know how to read and interpret your data. However, if you were to leave your organisation tomorrow, would another person know how to access, read, and interpret the data you have gathered? Your data dictionary will ensure that, no matter what, documentation exists which will explain everything someone needs to know about your database. 

Your database designer can, for example, use the data dictionary to correctly arrange tables and fields in your database, while cataloguers can reference the dictionary to make sure they enter information correctly for each record. Additionally, when reporting about the results of your data, having a data dictionary at the ready to explain your findings will ensure greater processual transparency. Keeping a data dictionary will help you: 

  • quickly detect anomalies in your data, 
  • maintain the quality and reliability of your data, and 
  • build transparency of data processes across your organisation. 

What should you include in your data dictionary? #

As the ultimate user guide for your database, your data dictionary should clearly explain the components present in your data. You have taken the time to think through why your database is important, how it will address your users’ needs, what your conceptual data model looks like, and what terms should be included in your controlled vocabulary. Your data dictionary is a space to document your decision-making and record the systems you have put in place to organise your data. Some things you should consider including in your data dictionary are: 

  • Your controlled list of terms containing all the names and definitions for the terms in your database. 
  • Your conceptual data model containing: 
    • the entities in your database,
    • the attributes you use to describe the entities in your database,
    • the relationships you have identified between your entities,
  • Your data entry standards, or rules, which describe the standards for collecting, recording, and representing data in your database.  

As you build your database and begin to introduce it to different users, you might also consider expanding your data dictionary to offer information on the different functionalities of your database and how to access the information it holds.

How to create your data dictionary #

In order to create a data dictionary, it is important to first consider the following questions: 

  1. What does each element in your data represent? What is it describing? 
  2. How did you collect each variable? How did you measure it?
  3. What are the tests you run to ensure the trustworthiness and validity of your data?
  4. Who collected the data and who has interacted with it since its initial collection? It is important to consider how data has been collected and changed throughout its lifespan.

Once you have considered these questions, you will be ready to develop your data dictionary.

Step 1: Pull together your terms #

Throughout this process of preparing for your database, you have identified the entities and attributes you will include in your database. In developing your controlled vocabulary, you have gathered all the terms necessary to answer the questions you would like to ask your database. Now it is time to collect all these terms, along with their definitions, in one central location.

Step 2: Identify important information about each term #

Your data dictionary will identify important information about each element in your database, so consider what a user will find useful when inputting or accessing information. 

Your data dictionary should include the element name exactly as it will appear in your database and the element definition as defined in your controlled terms list. This definition will reflect the way you use the term and intend the term to be used by fellow cataloguers or database users. If we take, for example, the organisation monitoring the killing of journalists in their country, as described in the previous resource section on Determining your controlled list of terms, they would include in their data dictionary the specific definitions of the types of killing they monitor and track in their database to ensure a consistent understanding of each type across all actors in their organisation. They might also add an element description if a longer explanation is needed to clarify the definition of an element.

You should include in your data dictionary which fields are optional or required, as well as the element type which describes whether the field is text, numeric, date/time, enumerated list, look-ups, booleans, unique identifiers, etc. Identifying the element type will help ensure consistency in data entry and ultimately help you to draw well-founded insights from the data. 

Finally, you will want to include in your data dictionary any element rules explaining the process for properly inputting information and the range of values or accepted values for the element. If we take the example from the previous section on recording the location of the killing in a free text field, we remember that differing standards of entering information in the database can result in confusing outputs. Without a rule in place for entering location information, the entered data might look like this:

This results in inconsistent data that makes it difficult to draw conclusions about the location information available. To avoid this confusion, document the rules for entering free text data to ensure consistent standards across the board. 

Step 3: Document your dictionary  #

Now that you have gathered all the information to include in your data dictionary, make sure to document it in a central location accessible to all staff and users for whom it might be useful. Data dictionaries are for sharing, so it is important to make your data dictionary accessible for all users of your database. You might consider using a Google doc or spreadsheet with open access to database users or you could document your dictionary in a PDF format similar to the Data Dictionary + Controlled Vocabulary created by WITNESS and Berkeley Copwatch. Regardless of what format you choose, it is important that your data dictionary be accessible for all who might wish to access your database.

Step 4: Revisit and revise your data dictionary #

As you review and update your data model or controlled vocabularies over time, your data dictionary will need to be updated to reflect these changes. It is important to note for users when the dictionary was last updated and record a plan for revisiting and revising your data model and vocabularies for future staff or users to reference. It may be helpful to future staff and database administrators to keep alongside your data dictionary a brief record of the process describing the steps taken to conceptualise your data model and gather your terms. This guide could outline the goals you have defined for this information management project, the user personas you envision it being useful for, and the research questions you aim to answer. By planning in advance periodical reviews of the data dictionary, you can ensure that it remains up-to-date throughout the evolution of your information management project. 

Challenges and advice 🌱 #

When you don’t have the exact date #

How do you create a standard for entering dates when the exact date of an incident or event is unknown? One solution is using “fuzzy dates”, which are essentially incomplete dates. You can reference ISO Standards for capturing as much information as you have available to you. For example, if you do not know the day and month an individual was born, but you do know the year, you can enter only the year in the ‘Birth date’ field of the individual record. 

Such fuzzy date fields allow the following formats:

  • Year, month, and day (2011-01-01)
  • Year and month (2001-01-)
  • Year only (e.g. 2011-)

Note that, if you are using a spreadsheet to manage your information, you can leave a trail hyphen to stop the spreadsheet from changing it and ensure the data type is “text”. 

Use a confirmatory field for each value, similar to as shown below:

  • date: 2011-01-01
  • date_year: 2011
  • date_month: 1
  • date_day: 1
  • date: 2011-01-
  • date_year: 2011
  • date_month: 1
  • date_day: NULL

This means you can also check your dates by reconstructing the date from the confirmatory fields and seeing if it matches the actual date field.

For more information about “fuzzy dates”, see: https://kb.blackbaud.com/knowledgebase/Article/41804

Further Reading #

UC Merced Library. What is a Data Dictionary? (N.d.). Last accessed February 17, 2022 at https://library.ucmerced.edu/data-dictionaries

OSF. How to make a data dictionary – OSF guides. (N.d.). Last accessed February 17, 2022 at https://help.osf.io/hc/en-us/articles/360019739054-How-to-Make-a-Data-Dictionary

ISO. Standards. (February 19, 2021). Last accessed February 17, 2022 at https://www.iso.org/standards.html

Powered by BetterDocs