Applying machine learning to human rights documentation – an interview with Natalie

"It was very exciting to use machine learning when people can directly make use of the results."


At HURIDOCS, we are exploring how to use machine learning to help human rights advocates sift through large document collections to find the needles in the haystack. Possible use cases: to find the right legal arguments in the jurisprudence of the Inter-American Court of Human Rights, or to to find traces of discriminatory sentencing in large collections of court decisions.

We had the pleasure of working with intern Natalie Widmann from February 2016 to September. This blog post is a summary of a conversation we recently had with Natalie about her interests and her work with us.

Natalie tell us more about you, who you are, and especially how and why you got into machine learning?

I am 25 years old, I love travelling and all kind of sports, and currently doing a Master’s Degree in Artificial Intelligence.

After high school I was totally fascinated – and I still am – by the extraordinary abilities of our human brain. I wanted to know how it works and why we behave the way we do. Therefore, I started a Bachelor’s Degree in Cognitive Science that consisted of different fields like psychology, neuroscience, linguistics, philosophy and computer science – a lot of computer science. Even though I had a hard time getting a good grip on programming, I somehow got enthusiastic about it and wanted to dive deeper into computational models of the brain and current machine learning approaches. This is why I entered the Artificial Intelligence Programme in the Netherlands. The idea is to explore and understand how the human brain stores informations, solves problems, and develops new skills in order to apply this knowledge to technology.

What led you to HURIDOCS? How did you find us?

Within the Master’s programme we are required to do an internship for a few months. Most of the students either look for a research project at university or go into industry where mainly big companies have the capacity to offer interesting machine learning projects.

As I neither plan to do a university career nor want to work in an artificial intelligence department with the goal of maximizing profit, I started to look for options to apply the things I learned for a worthwhile objective. HURIDOCS – as an NGO that is working at the intersection of human rights and information technology – caught my attention. I asked for more information and after the first call with Tomàs, I was convinced that the internship at HURIDOCS will be an interesting experiment and a great opportunity at the same time.

What exactly is machine learning and why is it useful?

Machine Learning is a fast-developing field of computer science with a focus on the idea that computers can learn specific tasks without being explicitly programmed. One approach, called supervised learning, gives a lot of examples to the computer and by telling it which ones have a certain property and which ones do not, the computer figures out  patterns in all the information it gets in order to determine whether new data also has this property. Humans and animals learn in a similar way – we adopt certain behaviours by combining situations with the feedback we get from our environment.

Can machines really learn like us humans?

Until now, machine learning algorithms are adapted to very specific applications, like object or face recognition, text classification or certain games. But they aren’t able to connect and generalize these skills. For example, we can train a model to play chess even better than the human world champion. But the same model would fail to recognize faces even though the underlying algorithm is the same.

This is totally different from the brain and is one reason why it is so amazing. We have the capacity to learn and develop various skills, integrate information from different senses, and combine knowledge such that more complex behaviour emerges.

Can you share an example?

Imagine you are working in a post office and you want to automate the sorting process of the letters. To do so, you need a machine that can read the postcode. For humans, distinguishing digits is very easy – we already learn it in primary school. But a machine represents an image of a digit as grey-scaled pixels which don’t have a special meaning. So how can we teach a machine to recognize handwritten digits?

One possibility is to define simple rules: If there is only a vertical line, it’s a 1. If the vertical line has a hook at the top left, it is also a 1. But what if our handwriting is not very clear or if we are writing in a hurry? Would our rules be able to distinguish a 1 from a 7, or 6 from 0 and 8? Obviously, as handwriting differs a lot between persons, it is almost impossible to manually define exact rules.

This is where machine learning comes into play. Instead of defining rules, we let the machine itself figure out how the different digits look like. Therefore, we present hundreds and hundreds of handwritten digits to the machine, each of the digits with its correct label. This is the so-called “training phase” of the model. After the training we can ask anybody to write a digit and based on the examples the machines has seen it will predict a certain label – hopefully the correct one.

What was the most exciting thing you worked on during your internship with HURIDOCS?

It was very exciting to use machine learning when people can directly make use of the results. In one specific HURIDOCS project I worked on, we were working with a huge collection of documents (9,000) that represented judicial decisions from many different fields. The organisation we were working with had already put a lot of work into assigning metadata to each document so we already knew which cases are about sexual assault or in which case a family member was involved. This gives us the possibility to train and evaluate a machine learning model in order identify documents with similar characteristics (e.g. all documents about sexual assault). Moreover, we can extract different properties of the documents, like how often a certain word appears or which other documents are similar, to figure out which one of these properties carries more information about whether the document belongs to sexual assault cases. This helps to extract relevant information and to improve the accuracy of the model.

How did you do that exactly? What tools did you use?

First of all, we had to make the semantic information in the documents accessible to a computer that doesn’t understand natural human language. One basic approach is to count the occurrences of words in a document assuming that relevant words appear more often than irrelevant ones.

The screenshot above shows how the number of occurrences of the word “rape” in a document relates to whether the document is about sexual assault or not. We use these findings to better understand which properties of a document are relevant.

But there are also more sophisticated approaches, like topic modeling. Topic modeling looks at all the documents and tries to extract topics defined as a semantically associated group of words. Hence, we can use the percentage that a specific topic is covered in a document as a new way to represent the document. We combined both approaches and then trained a supervised classifier, in this case a decision tree, to predict if a document has a certain property (e.g. a sexual assault case).

The screenshot above shows some results we get from the topic modeling. Note “topic 12” covers the topic around sexual assault.

In general, I program in python and make use of scikit-learn, a powerful machine learning library with many different algorithms implemented and great tutorials to get started. For topic modeling I used gensim and now try to move to bnpy which is more sophisticated.

How do you imagine machine learning being useful to human rights defenders?

We are living in a world full of data – knowing how to use this data is a powerful tool. In almost all technologies and applications we use daily, machine learning techniques can be found. They offer the possibility to structure and map information, but also to customize tools and to gather knowledge about the user.

But there is also enormous possibilities for application in the human rights sector. It starts with making information easily accessible to everyone and building tools such that information can be organized and connected. For example, connecting reports from witnesses with images and videos can help to find evidence for human rights violations or misleading information in the media. Furthermore, it is possible to analyse and predict the spread of diseases in order to adapt political decisions like the closing of schools and public institutions. Another example, is the coordination of help in earthquake regions based on social media messages, satellite images and phone call data.

In this HURIDOCS project, machine learning facilitates the work of human rights defenders by filtering documents. So, if you are interested in a certain topic you don’t have to go through all documents and manually find out if the topic appears. Machine learning automates the process and gives information about what other topics appear and how they are proportioned. Moreover, it can recommend similar documents and extract properties like the age of the victim. This builds the basis for data analysis and allows insights into the relation and co-occurrence of properties. Due to the huge amount of documents and possibly complex interactions of properties, it might not be possible to achieve this without text mining and machine learning approaches.

In general, machine learning is thought to support the work of human rights defenders by letting them explore and structure their information in a new way. Even though we have just started to explore applications of machine learning in human rights work, there are already significant benefits and I’m sure that there is a lot more to achieve.

What was it like to spend your internship in Valencia, working with our developer team: Tomàs, Alberto, Carlos, Joan, and Dani?

In the beginning it was a huge change: working remotely, without fixed working hours, gave me the possibility to adapt my schedule to my current state of mind.

With regular meetings we kept track of the current process, interchanged ideas and planned the next steps. All of us benefitted from the different working experiences and backgrounds in the group. I learned a lot about good coding structure and a clean programming style. But the main reason why I really enjoyed these meeting is because they combined productive work with a lot of fun, typical Spanish food, breaks in the swimming pool, and a great atmosphere. In an environment like this, hard work feels like hanging out with friends, and the results we achieved are even better.

HURIDOCS Development Team

What did you enjoy about your internship?

Even though I was working on a very specialised topic, I experienced a lot of diversity during my internship: I got to know a new field of application, met very interesting people at conferences, worked in an independent and at the same time guided manner, spent a lot of time on coding, gave a talk at the WomensTechmaker conference, learned how to communicate the concepts of machine learning, and was part of an organisation that is constantly reflecting on its impact. Overall, I had a great time at HURIDOCS.

What do you want to do next?

Difficult question! Actually, I am quite happy with the current way of working. The internship showed me that there is a need for machine learning in the social sector and therefore I would like to support human rights organisations and investigative journalism.

So, let’s see…

Can you recommend a couple of articles and a youtube video where I can learn more about machine learning?

Machine learning is getting more and more popular so there is a lot of material out there that gives more insights into the topic. This blog post gives a gentle introduction to machine learning, its fields of applications, and the main concepts and algorithms. A really nice graphically illustrated example of a specific type of machine learning approach can be found here.

And if you already are fired with enthusiasm and you want to jump right into the algorithms and the maths behind it, you will find many free machine learning courses, for example: this one on coursera or this one on udacity.

What do you advise others interested in machine learning to do, if they want to do socially useful work? Where should they focus their efforts?

Actually, it is not that difficult. Social organisations often deal with similar problems and similar types of data as big companies. It is about transferring knowledge to a different field of application and connecting social organisations with machine learning and data science enthusiasts. Therefore, I think Hackathons provide a great possibility to get started. After two or three days of intense work full of great ideas, interesting discussions and a lot pizza, the results of the projects which often relate to social or environmental topics, are usually very impressive. Just have a look when the next hackathon will take place in your city.

Besides this project-oriented work, it is also important to make machine learning tools understandable and accessible to people and organisations outside of the big companies and the academic data science world.

Posted in: