You are here

The Ethical Data Scientist

[Editor's note:

We all like to rely on the certainty of numbers. A formula treats all input the same way and the output is something we place our trust in. But there’s a problem with that. If the formula is based on flawed, biased, or merely unexamined assumptions then the output will not be objective information. And the further away we are from the assumptions employed or data set used, the poorer we are at judging the reliability of the information derived. 

There are three kinds of lies: lies, damned lies, and statistics.' Mark Twain attributed this to Benjamin Disraeli.

This is an excellent piece that asks data scientists to learn to examine the assumptions made working with data.]



People have too much trust in numbers to be intrinsically objective.

In the waning months of the Bloomberg administration, I worked for a time in a New York City Hall data group within the Health and Human Services division. One day, we were given a huge dataset on homeless families, which included various characteristics such as the number and age of children and parents, previous ZIP code, the number and lengths of previous stays in homeless services, and race. The data went back 30 years.

The goal of the project was to pair homeless families with the most appropriate services, and the first step was to build an algorithm that would predict how long a family would stay in the system given the characteristics we knew when they entered. So one of the first questions we asked was, which characteristics should we use?

Specifically, what about race? If we found that including race helped the algorithm become more accurate, we might be tempted to use it. But given that the output might decide what kind of homeless services to pair with a family, we knew we had to be extremely careful. The very fact that we were using historical data meant that we were “training our model” on data that was surely biased, given the history of racism. And since an algorithm cannot see the difference between patterns that are based on injustice and patterns that are based on traffic, choosing race as a characteristic in our model would have been unethical. Looking at old data, we might have seen that families with a black head of household was less likely to get a job, and that might have ended up meaning less job counseling for current black homeless families. In the end, we didn’t include race.

Read Complete Article