Language is a beautiful way of communication that provides us with infinite ways to express ourselves. However, understanding language requires a complex combination of grammatical, contextual as well as cultural knowledge. Once this skill is developed, we naturally comprehend the meaning of spoken and written language. This results in the ability to quickly scan through a document, extract relevant information, and summarise its content.
As we are able to access more and more data, quick insights into a document’s content becomes essential. In the field of human rights, this takes on even greater significance because of the challenge to take qualitative data and create quantitative outputs. With limited resources, knowledge from different sources has to be connected to tell the story of victims, reveal underlying patterns of discrimination, and hold governments accountable. For these purposes, automated text analysis offers valuable opportunities. At HURIDOCS we make the work with document collections as efficient and insightful as possible by using machine learning to support the objectives of our users.
To illustrate this idea we have a closer look at our work with ICAAD, the International Center for Advocates Against Discrimination. ICAAD published an in-depth analysis of judicial sentences of sexual and gender-based violence (SGBV) cases in the Pacific Islands. They assess over 30 variables from each individual case, from the age of the victim to the final sentence received by the perpetrator.
However, their main focus relies on whether judges base their sentencing decisions on gender discrimination, such as stereotypes, rape myths, or cultural reconciliation practices. ICAAD’s work is extremely valuable because it uncovers to what extent court decisions are influenced by gender biases and how the access to equal protection and justice is denied to women.
Besides raising awareness, training judges, and empowering affected women, ICAAD is extending this analysis of judicial sentences to other countries in the Pacific Islands and make it accessible to a greater number of stakeholders including: women’s rights organizations, prosecutors, judges, law students etc. To accomplish this, we combine Uwazi, our new open-source tool to collaboratively work on collections of documents, with different machine learning approaches.
How we identify relevant cases using machine learning
First, the SGBV cases have to be identified from all published case law in the Pacific Islands. In machine learning jargon, this is called classification and assigns either the label SGBV or not SGBV to a document.
To achieve this, we train an algorithm by showing it many cases (8000 cases ICAAD had prepared and analysed for a previous report) with their corresponding label. Based on statistical correlations, words and patterns are detected that appear often in SGBV cases while seldomly in other documents and hence, give evidence for how to classify a new document. For example, the word violence appears in many different contexts and therefore does not contain sufficient information to decide whether a document is a SGBV case or not. However, combined patterns like sexual violence or domestic violence are strong indicators for SGBV cases. The more documents we use for training, the better the algorithm becomes in determining such patterns, leading to more accurate identification of SGBV cases.
How we use machine learning to extract information from the text
After the relevant cases are identified and uploaded to Uwazi, additional text properties can be added. For example, ICAAD adds properties such as: the court, year of a case, and information about the victim such as its age, gender, and relation to the accused.
To automatically extract this kind of information from the documents, the text is analysed, so that structured information regarding its meaning is obtained. This includes the search for typical word sequences that express a certain property. For example, the age of someone appears often in patterns such as is X years old, at the age of X but also is younger than X, is of legal age, between X and X years, etc. As you can see, this aspect already can be expressed in numerous ways which make it very difficult to capture by creating manual patterns. Therefore, this rule-based approach is often combined with the previously explained learning from examples.
These patterns, however, are not sufficient to identify who is X years old. Is it the victim, the accused, or does it appear in a completely different context? To solve this, the sentence structure is analysed and split into components such as subject, verb, object, etc., so that the person to which a certain text property relates can be identified.
How we use machine learning to explore the document collection
Besides these targeted approaches of identifying relevant cases and extracting information from text, machine learning also provides the possibility to gain new insights. Topic modeling, for example, extracts the different topics that are covered in a collection of documents. It gives information about which documents are related to a certain topic or which documents are similar based on the topics they cover. Using this approach, users can explore their collection of documents from a different perspective and identify interesting patterns that have not been considered so far. By including these approaches into the Uwazi platform, we hope to empower human rights advocates by making their search for information as efficient, effective, and beneficial as possible.
These algorithms, however, are far from being perfect. One limitation we’ve found is the knowledge gap between data analysts that build such models and experts that know their data and want to exploit its full potential. We envision Uwazi as an interface that brings these expertises together such that users profit from automated information extraction but at the same time have the power to adapt these algorithms by correcting the results in order to complete the feedback loop to improve their future performance.
In the long run, organisations that work with large collections of documents will spend less time and money in searching and extracting relevant information, and can instead focus on their actual mission.
Join us in the demo room at RightsCon to learn more about this initiative!
The successful use of machine learning algorithms in the field of human rights is only possible due to their implementation in open-source toolboxes like scikit-learn (http://scikit-learn.org/stable/).
Thanks to Friedhelm Weinberg, Hansdeep Singh, Hyeong-sik Yoo, Tomàs Andreu, and Kristin Antin for valuable input and feedback!