Data Ethics Club meeting 08-11-23, 1pm UK time#

Meeting info#

Description#

You’re welcome to join us for our next Data Ethics Club meeting on Wednesday 8th November at 1pm UK time. You don’t need to register, just pop in.

This time we’re going to watch/read The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers by Hanlin Li, Nicholas Vincent, Stevie Chancellor, and Brent Hecht, which is a paper from this year’s ACM FAccT conference.

It’s a fairly long one, so if your time pressed we recommend sticking to section 3 that discusses the dimensions developed by the authors. We’ve also included a brief summary below to make the discussion more accessible.

Paper summary#

The aim of the paper is to provide an actionable roadmap for researchers, activists and policy makers to data producers by synthesising information into six dimensions of ‘data labor’. The authors defined data labor as:

“Activities that produce digital records useful for capital generation.”

So, it must create (or enhance) data, and it must generate captial. This definition should include those who knowingly work in data generation (e.g. labelling dataets), and those who might produce data like social media data that is then (unknowingly) scraped and used by companies to build tech.

The six dimensions below were developed through discussion between the authors.

1. Legibility#

Data labor can be illegible or legible, depending on whether data laborers know their labor is being captured.
That is, how obvious is it that someone is creating data that is going to be used?
If you are working for Amazon Turk doing labelling and being paid, it is legible.
However, it might be illegible if you are labelling images for reCAPTCHA to access a website, or generating user logs by interacting with a website. The authors argue that illegible data labor is harder to organise against and understand, and so disempowers those who are producing data.

Researchers might create tools that communicate the value of data, like the Facebook Data Valuation Tool, that calculates in real-time the worth of users’ attention. For legible data labor, those doing the labor may choose to organise against unwanted practices by striking.

2. End-Use Awareness#

Do data laborers understand how their labor is being used?
For example, providing ratings for a recommender system is obvious, but people writing wikipedia articles might not realise their work is contributing to training LLMs.

We should aim to increase end-use awareness because “end-use awareness is likely to lead to more power to data labor. If one understands the downstream capital generation implications of the data they produced, it is possible to alter such labor purposefully.” They suggest that researchers and activists might also be interested in how to use end-use awarness to organise collective action (e.g. data strikes or data poisoning) that protests unwanted uses of data.

They also suggest regulatory measures, like users being able to say that their data should not be used for surveillance.

3. Collaboration Requirement#

Some data is produced in collaboration, such as Wikipedia articles. Some, like filling out a reCAPTCHA, is done individually.
Working in collaboration has the benefit that it is easier to organise, but also means that engaging in activism may make people fearful of losing their communities.

4. Openness#

How accessible is the downstream data to the public?
Editing open source information like Open Street Map is very accessible, your personal browser history is not.

Open data is great for sharing knowledge, but many private companies benefit from this free open source data. For example, Wikipedia being used to train LLMS.
The authors suggest the mapping of how open source datasets get used in private technologies may help us map the downstream impacts of data labor. On the other hand, new ideas for data governance systems that are more collaborative (e.g. participatory data stewardship) can support keeping data private unless for good or approved causes.

5. Replaceability#

Is this data labor replacable?
Labelling images is likely to be, but adding specific data about your neighborhood to Open Street Map is much more difficult to replace.

This links to supply and demand - people with rare data to contribute have more power in the system because what they have is desireable and unusual. However, for very replaceable data it would be hard to have much power against big organisations without a huge general data strike.

6. Livelihood Overlap#

Is the data labor part of the creators’ organisational responsibilities?
For example, journalists writing news articles published online and used for model training have a high overlap.

There is a need to allow people to exert power over data created as part of their livelihoods, like artists giving consent for their images to be used in model training.
It’s also important to keep an eye out for how new developments in tech might become jobs - like image labelling is now a type of paid labor.

Discussion points#

There will be time to talk about whatever we like, relating to the paper, but here are some specific questions to think about while you’re reading.

  • What examples of data labour do you perform and what are the most relevant dimensions? How does this compare with any data labour that is used in your work?

  • What do you feel are the strongest opportunities for empowering data labourers as defined by the article? Can you think of any important opportunities that were not mentioned?

  • How would you incentivise or enact the changes mentioned in the papers roadmap in your field?