"Data is not truth... [It] is necessarily a pragmatic integration."
Jacob Metcalf
Welcome to "Cookies, Coffee, and Data Ethics"! My name is Jeff Kampfe and I am a senior at Santa Clara University, studying Economics and Philosophy. I am also a Hackworth Fellow at the Markkula Center for Applied Ethics. This article is the first in a series of interviews that involve hot coffee, tasty baked goods, and the complex issue of data ethics. The goal of these interviews is to see what individuals with unique and insightful viewpoints can teach us about the field data ethics, where it is heading, and what challenges we may face along the way. Thanks for stopping by!
The following is an edited transcript of a conversation with Jacob Metcalf.
Jacob (Jake) Metcalf, PhD is a consultant and scholar specializing in data and technology ethics. He is passionate about helping people understand how seemingly small decisions about data technologies can have significant and disparate consequences in society. His academic background centers around the applied ethics of science and technology. Jake’s scholarship in data ethics is recognized as influencing this nascent field, particularly around issues of research ethics, policy, and practice in academia and business. He lives amongst the redwoods of the Santa Cruz mountains, where he fulfills his lifelong dream of running a philosophy business from a converted barn.
Can you tell me a bit about your past experience with data science?
My experience has been coming at data science from Philosophy. My dissertation and a lot of my grad school work was on environmental ethics and bioethics. So that gave me a lot of time to think about questions about how you resolve open ended problems in science and technology. It was sort of happenstance that I went towards data ethics after grad school. Opportunities came up, it seemed interesting and it looked like it was going to be a legitimate issue going forward. Computer ethics was definitely a topic ten years ago, but data ethics was not.
It has been obvious for a while that Silicon Valley needed input from ethicists and social scientists. Companies such as Xerox and Microsoft Research have been able to employ a lot of social scientists, so there is a history of people with humanities and social science education being employed by the tech industry. What there hasn’t primarily been is advice about how to reason vigorously through ethical problems. The role of people with humanities backgrounds in business has been more oriented towards product development than towards governance. It’s been my opinion for a while that there was going to be need for more governance and guidance, so I lept at the opportunity to follow that thread towards research and consulting.
Is it possible to develop a set of practices around what makes a good data scientist? If so, what do those look like? What are the virtues of a good data scientist?
The ability to see context. To be able to have a sociological imagination and to see that there are ripple effects to introducing technologies. Knowing that even seemingly small design decisions can profoundly affect how people live their lives. That’s not just a matter of principle, it’s a matter of skills. Someone might agree fully with my statement that even small design decisions and algorithmic systems can have significant effects on individuals’ lives and on society, but the ability to see that trend a then make a choice is not itself a principle but a practice.
It’s multidisciplinary; you have to read broadly and you have to take the time to understand other people’s perspectives. These are things that data scientists and computer scientists are doing, however the way that they are trained in school and what their employers expect of them doesn’t allow space for that investigation. I view goal of data ethicists working inside of industry, consulting, or as researchers to build that capacity: Having an ethicist on site to help run through design questions, not someone to tell you what to do or what’s right or wrong, but reason through outcomes.
There are training programs that can be done, but first of all it requires time. Time for additional humanities research is not something that industry typically provides engineers. It’s a skill that can be taught if it’s allowed to be practiced. Examples of this practice could be working with case studies or having a lunch time reading group where the group reads an article a week followed by discussion.
What are your thoughts on data ownership? What are some of the largest issues that need to be resolved surrounding this topic?
I actually don’t think ownership is a particularly useful model for what it is that companies actually do, what it is that consumers should want, and what we deserve to know about our data. Who owns your data is almost inconsequential compared to who processes it. The old privacy model is that privacy controls who knows what. If you want something private, it’s your possession, and you can determine where that information goes. That is a very hard model to enforce. It is a very hard model to gain any grounds on in the age of data analytics because every data set is built on the assumption that it can talk to every other data set.
Let’s say company A owns your dataset with your data in it. Company B wants to process that data with relation to a data set they hold with relation to you. The risk is that analyzing those data sets together exposes you to a new kind of risk. The fact that one of them owns one kind of data set and another one owns another kind of data set is immaterial, due to the fact that they can have a contract that allows those data sets to talk to each other. The whole data economy is about licensing. It’s about getting data sets to talk to each other.
There are some interesting models that want to attempt to build data in a “personal cloud” that you could then connect to different service providers, who then compute on your cloud as opposed to a centralized cloud. New projects, such as those headed by Tim Berners Lee, have the goal of decentralizing the data economy so that individuals hold their own data. Each person has 1 unit of social networking data, 1 unit of health data, ext., that can be transferred to different providers because they function within universal standards. It works as privacy preservation through centralization. His project is a recognition that with the way that data is architected and the way that corporations are governed, there is basically no model of ownership that is going to favor individuals. So instead what we need to focus on is control over flows.
Is it ethical for organizations (incl. companies, governments, nonprofits, ext.) to keep data sets indefinitely? If not, how should that issue be addressed?
You should be able to put a clock on your data. When you leave Facebook, you still have the option to delete your account and fully delete all of your data. Almost every company gives you the option to delete your data, but if you don’t delete it the company will hold it presumably indefinitely. Lots of different entities holding your data indefinitely multiplies your risk of disclosure greatly. Without a right to know for users, there are tremendous security and privacy risks.
Does describing people as data points lose a sense of their humanity/dignity? Or is that simply just part of the process when doing data science?
I think we capture the aspect of human dignity by thinking about what’s done with the data. I don’t think there is anything inherently undignified about a data ecosystem. The dignity comes through our decisions about how to treat each other. There is nothing necessarily undignified about a data economy. It can have some very particular weaknesses regarding human dignity, and we can see the ways it goes wrong. It is easier to be a data authoritarian than it is to be a data revolutionary. Data science does have some risks to human dignity, but that is not inherent to the technology itself. We haven’t really had enough time to figure out what the risks are. I think we can see creeping around the edges what a data dystopia looks like, but the solution to that isn’t giving up the data economy. It’s figuring out what our values are and what governance structures are available to us.
Do you have recommendations from either top-down or bottom-up methods of how this gets addressed? Is legislation appropriate or does the market seem to correct itself regarding what people want in terms of privacy?
Obviously regulation is part of it, a significant part. In order to have something better in terms of regulations, we need to have new kinds of infrastructures. In order to push companies to a new kind of data economy, you must provide different kinds of incentives. It’s not that companies couldn’t make a ton of money under other models; they just won’t switch models unless someone either forces them to or incentivizes them to. Regulation is absolutely critical, but you also need companies who can dream up other models. I think there is absolutely space for creative thinking about how governance meets ethics for tech companies.
How might we as a society balance the benefits that data analytics gives us (in health care, security, efficiency, innovation) against some of the potential threats it poses to privacy and autonomy?
I think we need to conceive privacy not as something that happens to an individual, but as something that happens to a society. You can see, through things like Cambridge Analytica, how the consequences of privacy permeate across society. How the privacy violation ties in is that Facebook didn’t have enough policies regarding who could scrape data from friends’ networks. They had almost no governance about who could do what on their platform using their API. That allowed actors to use data to anti-democratic ends.
The Cambridge Analytica case further demonstrates the risk of having multiple datasets talk to each other. Cambridge Analytica couldn’t have happened if we had any kind of controls whatsoever over voter information databases. Both Cambridge Analytica and the Trump campaign were able to produce these insane numbers of ads and beta test them to huge numbers of people in very small tranches. This is because they were able to take these stolen Facebook psychological profiles and combine them with voter data sets that have almost no rules governing them whatsoever.
Now is that a privacy violation? You could say every single act of a stolen profile is a privacy violation, but their purpose wasn’t to violate the privacy of individuals. It was to manipulate society and to push ideas that voters would typically reject. To what extent is that a privacy violation, to what extent is that just good old propaganda, the jury is still out. But what I know is that privacy is not something that just happens to individuals anymore.
This question might be a bit more philosophical, but bear with me. Is data only an instrumental good or can it be valuable for its own sake?
(Laughs) I’m actually not sure that I believe in inherent value. Data is not truth. I don’t know how to measure inherent value, but let’s say truth has inherent value. Data is a particular form of information produced entirely by humans for the use of humans. Inherent value, to me, means something that is transcendental. Freedom, knowledge, truth, things like that have both inherent value and pragmatic value. There are a lot of things you can’t do if you don’t have truth. I think data is necessarily a pragmatic integration.