"[T]echnology is neither radical nor revolutionary unless it benefits every single person."
Welcome to "Cookies, Coffee, and Data Ethics"! My name is Jeff Kampfe and I am a senior at Santa Clara University, studying Economics and Philosophy. I am also a Hackworth Fellow at the Markkula Center for Applied Ethics. This article is the fourth in a series of interviews that involve hot coffee, tasty baked goods, and the complex issue of data ethics. The goal of these interviews is to see what individuals with unique and insightful viewpoints can teach us about the field data ethics, where it is heading, and what challenges we may face along the way. Thanks for stopping by!
The following is an edited transcript of a conversation with DJ Patil.
DJ Patil was appointed by President Obama to serve as the first U.S. Chief Data Scientist; his efforts in that position led to the establishment of nearly 40 Chief Data Officer roles across the Federal government. He also established new health care programs, including the Precision Medicine Initiative and the Cancer Moonshot, and new criminal justice reforms, including the Data-Driven Justice and Police Data Initiatives that cover more than 94 million Americans.
In industry, he led the product teams at RelateIQ (which was acquired by Salesforce); he was also a founding board member for the Crisis Text Line, which uses new technologies to provide on-demand mental health and crisis support. He has also served as Chief Scientist, Chief Security Officer, and Head of Analytics and Data Product Teams at the LinkedIn Corporation—where he co-coined the term Data Scientist. Prior to that, he held a variety of roles at Skype, PayPal, and eBay.
Can you tell me a bit about your past experience with data science?
I think the synopsis of what has happened with this unique way of approaching data has been the incredible rise of computing power, cloud computing, and advanced algorithms. You can do so much more computing in a very small real estate. You have a lot of processors, storage, and memory on the devices themselves, and then you also have the technology and algorithms that have made a disproportionate impact, from deep learning to the cloud infrastructure that allows you to try things out very quickly and look for patterns.
So for me, a lot of this has come down to playing with data to understand things. I was a mathematician, but I was also interested in understanding patterns in the weather. How can you do a better job of improving weather forecasts? I was practicing with the weather what you might call data science today, but back then it was more like “Well, I’m just going to try to find patterns and insights based on what I can get my hands on.” I was able to find datasets on atmospheric data from the national weather service, so I downloaded it, analyzed it, and was able to help make improvements in weather forecasting. Since then it has been off to the races.
How did the experience of being the first Chief Data Scientist of the United States impact what you are pursuing now?
Well, the biggest impact was realizing that technology is neither radical nor revolutionary unless it benefits every single person. There are so many places where you are seeing technology and data do really incredible things, but actually not benefiting everyone. You talk about the mobile phone being amayzing with regards to connectivity, and while that may be true, it is not benefiting everybody. A child who lives near a jail will not get the same amount of investment that children in other areas might receive. That kid never picked to be born near a jail, but they are dispositionally impacted by lack of resources and investment.
How did that realization play into how you see policy surrounding data science?
The first mistake that we often make is that we don’t have the people who are going to be impacted directly at the table as we are building new technologies. Often time we have personas for what a person might say or feel, but often times those personas look very much like [the pictures in] photo frames of people when you go to a store. They are all happy, nice, and joyful. But where is the persona of a domestic violence victim? Or the persona of a veteran? Or the persona of a drug addict? Where are those personas? What are we doing to empower the LGBT community member? We don’t say these people’s names and we treat them as data points. When we do say people’s names, they are very Anglo names. There is no Juan, or Rashika, or anybody else. We skew our mentality in ways that feel very minor, but are deeply. deeply insensitive. We have to be very careful because when we are building something, it is easy to say “Eh, that’s just an edge case.” But the edge cases have names. The edge cases have families. We cannot lose sight of that.
Can you say more about how describing people as data points can impact their dignity?
The first point is that if you have that person represented at the table, you have to be able to look them in the eye and tell them what you’re about to do. Often times, they won’t get you get away with those things. Or they will ask you a very elementary question.
We usually think of data abuse as the perversion of the way in which data is being used, but it is also the perversion of the way in which data is being locked up. If you talk to family member of a child who has a chronic condition that is genetic and rare in nature, they might want their data opened up in the hope of a cure. A lot of times we project our ethos on there, but we don’t necessarily think of it in other ways. How do we make sure that the data is safe, but also supports this example of someone who might have terminal cancer and needs their data out there so that they can have a shot at a cure.
How might we as a society balance the benefits that data analytics gives us (in health care, security, efficiency, innovation) against some of the potential threats it poses to privacy and autonomy?
There are two critical things right off the bat. One is that every college curriculum, massive online course, or any place where education is happening needs to teach ethics as well as security as part of the key curriculum. These can’t be electives—they have to be part of the core curriculum. From that, I think we will start to see people solve and approach problems differently. I’ll give you an example. We ran a mini AI class for some 6th graders this past summer. They were trying to figure out what to build and they decided on a face-tracking device that would fire a nerf gun whenever their sister or brother walked into their room. It was a cute idea--but then we suggested, what if it was an autonomous weapon instead? That was too much for them to understand, so we pivoted to the classic trolley problem exercise. We played devil’s advocate all day long and eventually we got, within the exercise, to “what if it is one of your parents on the track?” Would you still pull the lever to save the five lives if it meant the train runs over your Mom? Suddenly, you could see they understood. It was no longer just a data comparison of 1 vs 5, it was tangible. Now when you ask again what they want to build as their project, the autonomous gun seems less appealing. So overall that is what is in bucket #1: Ethics and security. You have to have security along with ethics because things can always be abused or hijacked.
The second idea comes from the book that Hilary Mason, Mike Loukedies, and I wrote. Any technology team needs to be able to ask “How could this go wrong?” “What are the consequences?” I think a lot of the time people are looking for solutions around ideas such as “How can we make sure this algorithm is fair?” I don’t want to take away from any of those efforts and I think we need to invest in them, but we shouldn’t just put a stop to using technology because we don’t understand certain related aspects. We didn’t understand the implications of things like antibiotics when they were first deployed. There was a crisis ahead of us and we created these technologies because there were real problems to solve. Sometimes there aren’t ways to find solutions without technology.
When we think about community policing and having the ability to look through datasets to find where and under what conditions officers might use excessive force, that is an instance where we want the technology to lead the way forward. There is too much on the line. There are too many cancer patients with their lives at risk to tell them “Sorry, we can’t use data collection to help with your condition.” We have to be more aggressive at figuring out what helping people collectively looks like.
What do you think are some of the virtues or values that make a good data scientist? What do those look like in practice?
I actually believe our system of the 5 C’s is applicable here, so let’s talk about where those 5 C’s come from [The five C’s are what we ought to keep in mind when building data products: They are consent, clarity, consistency, control (and transparency), and consequences (and harm).] So to develop the 5 C’s, we basically did an experiment. Everyone talks about having a code of ethics or a type of oath that people might take, and while those are great, they aren’t that effective in actually policing an organization. They are necessary, but not sufficient. So we asked ourselves “Where have been the greatest atrocities, involving data gone wrong, that have resulted in systematic changes?” The answer to that is the atrocities committed by the Nazis. We then began to look at the Nuremberg Trials, the Nuremberg Code, and the resulting notion of consent in bioethics. Even after this conception of consent was rolled out, we still got the prisoner experiment, the Tuskegee experiment; afterward, the rules of consent really became codified. So as that was taking off, we looked at what biomedical practices “broke,” in a sense.
We then asked what would happen if you re-did the Nuremberg trials and the Nuremberg code if the Nazis had access to the technology we have today. Then you realize consent is good, but not enough. If the Nazis had big data and large databases, they would begin to gain richer insights. I think that everyone who is practicing within the data privacy space ought to do that thought exercise as homework. Let’s also remember that the Nazis did use technology to find people (with the help of IBM). In this day and age we have much more sophisticated technology to track and monitor people. There are conversations and questions going around about Salesforce and the U.S. Immigration and Customs Enforcement (ICE) regarding the separation of families. You have questions about open source technologies, questions about artificial intelligence and its military uses. We should ask ourselves, “What would this technology look like in the hands of Nazis or other groups that might weaponize this technology for oppression?” Questions like these require us to think harder about the topics we are talking about.
What are your thoughts on data ownership?
I think ownership and rights surrounding data are still quickly evolving. One of things I am most concerned about is that we are having a conversation about today versus tomorrow. This landscape is going to look so dramatically different, simply because of the rate at which things are changing. Take genomic medicine for example. Say two spouses give up their DNA. If they are parents to a child, effectively that child’s DNA is given up as well. What does that mean for the child? I don’t think we fully understand yet.
On the other side, we also have the question of transparency about how people are using our data. That is on the side of Facebook or other advertisers, or even hospitals selling our data. We don’t know what they are doing. We don’t have the ability to see or correct the information that is being used. For example, what if it’s an African-American family: Is their data being used in a way that disproportionately disadvantages them as opposed to somebody else? How do we have transparency within this issue?
I do think we need some type of group that is constantly looking at and evaluating these things. Just like we have the Consumer Finance Protection Board to look at finances, I think there should be a version of a consumer data protection group whose job is to look into things like this.
The other conversation that is still developing is with physicians, hospitals, and other portions of the medical industry, which believe that medical records are theirs and do not belong to the patient. They believe they can do what they want with that data, but then the data ends up impacting the patients. This data is also prone to security breaches--so they are not treating it as if it was their own data. I wonder if they would be willing to put their own personal data within their own data set.
Is it ethical for organizations (incl. companies, governments, nonprofits, ext.) to keep data sets indefinitely? If not, how should that issue be addressed?
That’s a great question and I don’t think I know the answer. It depends on the situation. For example, there are certain datasets that we want to live a long time. Usually those are very scientific based, such as data collected about climates. Those are historical records that give us great proportional benefits. This might also be the case with healthcare data. Perhaps maybe over a long arch of observances, someone might understand an interesting trend that will a big benefit to the people who come after us.
There is the other side of what happens wheb data has been collected, and the company that collected it goes bankrupt, and another company buys it only for the data assets. Is that data still being used in ways consistent with the previous terms and service agreement? These questions get into the interesting ethical space about what is acceptable. What should we expect about our data? Right now, I think the only expectation that we all have is that our data is going to be abused, leaked, or stolen. We feel really powerless to do anything about it or to know who is doing anything about it. That’s our expectation right now and that’s perverse. That is just a slight ways away from that data being weaponized against us.
Who would you say should make decisions about what types of data are considered more or less sensitive?
The biggest challenge is that we have only lawyers who are looking at the problem. Or in other situations we have only technologists who are looking at the problem. In none of the situations do we have people who are actually impacted by the problem. GDPR is a really interesting example because GDPR was done without any technologist really at the table. As a result, it’s really confusing. Some things are great and are done with good intentions, but they have poor implementation. That is a growing realization: If you don’t have technologists at the table, you won’t get it right. The same is true if you don’t have, at the table, the people who are impacted. What we have to do is make the table bigger and have the discussions at the table be more intellectually honest. And we have to ask the question not just about what happens for today but what happens for tomorrow. One specific example of this is the Genetic Information Nondiscrimination Act. It has not been tested in court anytime recently, but it is already kind of outdated. At the time it was implemented there was genetic testing, but it was not anticipating a world in which there was such high number of tests that yielded such specific results.
What do you think is the obligation of consumers to attempt to change some of these practices?
I think it’s tough to put the onus on the idea that a user should know what is happening, when we can’t see all of the things that are being tracked. The ability to marry large data sets together, for example, allows companies to do things we are not aware of. We are very focused on talking about the internet companies, and I don’t want to give them a pass, but we also have to think about all the data brokers out there. (There is also the question of whether the internet companies are actually themselves data brokers.) There’s a great report from the FTC (that Congress has not acted on) that offers recommendations about looking into data brokers. Think of examples like the large data breaches from companies such as Equifax, who reached out to their “customers” who, it turns out, had not actually realized that they were part of Equifax’s service. We can also have this conversation about the government's data collection as well: the Patriot Act, things we’ve seen in disclosures from Snowden, and other similar issues have not gotten as much congressional oversight as they warrant.
For prior articles in this series, see "On Data Ethics: An Interview with Jacob Metcalf," "On Data Ethics: An Interview with Shannon Vallor," and "On Data Ethics: An Interview with Michael Kevane."