Dimensions of Data Labor#

What’s this?

This is summary of Wednesday 8th November’s Data Ethics Club discussion, where we spoke and wrote about the paper The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers by Hanlin Li, Nicholas Vincent, Stevie Chancellor, and Brent Hecht. The summary was written by Jessica Woodgate, who tried to synthesise everyone’s contributions to this document and the discussion. “We” = “someone at Data Ethics Club”. Huw Day, Nina Di Cara and Natalie Thurlby helped with the final edit.

What examples of data labour do you perform and what are the most relevant dimensions? How does this compare with any data labour that is used in your work?#

The paper aims to provide actionable guidelines for addressing challenges associated with data and power inequalities, arguing that treating data as an outcome of social labour enables broader power distribution. Data labour is defined as “activities that produce digital records useful for capital generation”, including both labour that is compensated for (e.g. Mechanical Turk) and uncompensated data production (e.g. generating behaviour logs). We liked that the intention of the paper seemed to be that “this is happening; what can we do about it”.

To form a roadmap for empowering data labourers, the paper synthesizes relevant literature into six key dimensions. The dimensions listed are:

  1. legibility (do people know they’re doing data labour)

  2. end-user awareness (do people know their labour creates capital)

  3. collaboration (do you have to work with others to do the labour)

  4. openness (who can use the results of the data)

  5. replaceability (can anyone do the labour)

  6. livelihood overlap (is the labour part of occupational activities).

We found these dimensions interesting, however there does seem to be some crossover which makes it difficult to clearly distinguish between them. We wondered if these were the best dimensions to pick, or the best ways of defining them. For example, we didn’t like the name “legibility” and preferred the label of “visibility” for that category. Legibility is defined in the Cambridge Dictionary as “the fact of being easy to read, or the degree to which something is easy to read”. Visibility may more accurately capture the intended meaning of this dimension: “do data labourers know their labour is being captured”. Sometimes systems obscure data (intentionally or otherwise), and it can be hard to track how legible something is.

Many of us participate in various forms of data collection at work; we found it surprising that the paper brought up reddit a few times - reddit (and similar activities) seem quite different to the sorts of data collection we do. Some of us had taken part in crowdsourcing research for ecology, tagging videos of animals. When it wasn’t part of our own research it felt interesting and exciting, but if it became part of our job, we might find it boring. As PhD students some of us had experience scraping through databases, looking for associations and trying to build our own datasets. We found this to be an enormous amount of very boring work. We also talked about participating in cleaning internal data, CAPTCHA, and uber ratings.

On the industry side, some of us had experience working for companies who collect data or use third parties to collect data from social media. How this data is used isn’t very transparent, and platforms generate a huge amount of money from it. Companies look at dollar revenue through post engagement. For the most part this money does not reach the people generating the content; people even have to pay not to have adverts.

The mechanics of how to redistribute revenue was not talked about in the paper. Examples of revenue redistribution we could think of include the huge industry YouTube has created through revenue sharing. We wondered if revenue sharing could happen on Reddit, for example by paying people by how much karma they have. However, we thought that bots would still be an issue. Redistributing revenue seems like something that would be especially difficult to do for more downstream data labourers, where data production is fairly passive.

This brought us to the question of what really counts as data labour. While we thought that the concept is a generally good idea, we had some issues with how it’s described. Many of the processes for generating data are very diffused, and this is where the data labour paradigm breaks down. We thought that companies saying they would pay people for small interactions (such as comments on social media) would just generate a flood of useless comments. It is better to treat this kind of diffused data as a natural resource; the work comes in when someone comes along and says they will mine it. This way, we can think about data labour along the conventional lines of how we separate work and resources.

One way of understanding the separation between work and resources is by reference to social contract theory, which argues that by participating in society and benefiting from its provisions you enter into a social contract. For example, if you walk down a road built by a government you enter society and have some obligation to it. From a data collection standpoint, an example could be that you don’t have to walk down streets with AI facial recognition surveillance in London, but if you do you will be tracked. Viewed from a social contract perspective, you do not have to participate in diffused methods of data collection (e.g. commenting on social media), but if you do then you should accept that your participation will be taken as a resource. There might however be questions that arise with how much autonomy we really have over these “choices”. Refusing to participate in diffused data collection might be incredibly difficult in some domains, for example if we have to take an extremely long detour to avoid streets with facial recognition.

The paper defines data labour as data production which is useful for generating capital. This suggests that the step from generation to labour is dependent on what happens to the data after it is produced. The paper mentions a plugin tool which calculates how much you earn while scrolling Facebook. We wondered if similar tools could be used to put a valuation on data production over the course of a human life.

We argued that a lot of the activities surrounding data mentioned in the paper aren’t labour. Scrolling through Twitter isn’t, and will never be, labour. Defining everything which isn’t capital as labour is overly simplistic. Leaving a Yelp review probably isn’t labour. Clicking through reCAPTCHA isn’t labour. It feels disrespectful to those who really do undertake data labour, and who should be unionised and protected. It also feels disrespectful to those who have had their work stolen and repatriated, such as artists.

Defining more passive activities like comments on reddit as labour did not seem accurate to us. However, the paper is making the point that your approach to these activities changes if you conceptualise them as labour. We wondered what would change in people’s behaviour and attitudes if every blog post and Google search was defined as a piece of labour. We thought that there might be better labels than “labour” for this, such as “data producers”. The definition would also be improved by more descriptive categories denoting where different activities fall. We should assume that everything we put online might be used for other purposes. For example, platforms such as Twitter are being used in datasets for AI training. Some forms of data generation might be used unethically in certain capacities but still worth doing. For example, it is hugely valuable to create Wikipedia articles, even though we know that LLMs are being trained on them.

It seems so embedded in online participation that it will be an unfair trade. The amount of storage, analysis, and power that big tech companies have makes it very hard to change the system. Companies don’t want to inform data labourers of their rights; it is not in their interests to put effort into educating employees when they could just hire cheaper labour elsewhere. There are so many useful things which aren’t shared even in common pipelines, such as no-one being taught about pensions when they do degrees and Ph.D.’s.

What do you feel are the strongest opportunities for empowering data labourers as defined by the article? Can you think of any important opportunities that were not mentioned?#

One of the issues with the type of data labour the article defines is that it is a very easy space to atomise. The labour is diffused across many different aspects, making it easy to exploit. We saw parallels with the Ford production line, which divided tasks into many smaller subtasks. One worker would do one part of the process, making it easy for them to slot in and out of the workflow and for them to be replaced. This transformation of labour was done partly for efficiency, and partly for control of the workforce. We thought it was remarkable how easy it is for similar segmentations of labour to occur in data. This is another example of society being broken up into pieces, diminishing collaborative aspects of life sad capitalism klaxon. We thought that an important opportunity from this would be to push back against the atomisation of data labour, promoting collaborative work between people who can collectively do a wide range of tasks.

One of the reasons data labour is so easy to atomise is that data resources are so easy to access, and we have so few rules. We thought that the paper does a good job of identifying desirable end results for redistributing power in data labour, but not about the mechanisms necessary to get there. An immediately important step is to improve legal protection for workers in this field. [Mechanical Turks], for example, have terrible working conditions. Up until now, there has been a bit of a gold rush with big data. To change how this work is done, there needs to be a complete revolution. This would have to be a forced change, not a choice. The best way to force a paradigm shift is to make people go through a process to have access to data. This would necessitate labour of someone who is qualified and trained to do the task.

However, the internet is spatially and geographically distributed, and data labourers emerge from many different pipelines with different levels of skill. This makes it challenging to create universal standards and regulations, and for data labourers to unite as a collective. To think about how we might bring about these changes, we should recognise them as extensions of the current law. For example, US legislation makes it easy to sue for copyright infringement. We should push to get this recognised in regard to data labour. There has been some progress made in regulation since the introduction of GDPR, which has especially impacted the social media insights industry. TikTok doesn’t have a public API; LinkedIn does, but it is aggregated to embed more anonymity. However, it is difficult to keep regulation up to date as the industry continues to change so rapidly.

The future of data production might look very different to how it does today. We are interested to see if large AI models continue to learn or end up in huge feedback loops as they learn from their own, incorrect, generated content. There are valid use cases for LLMs, however we think that there is also a lot of unnecessary hype. It remains to be seen if big data becomes the foundation of a stable economy. A certain amount of the hype around big data and LLMs might stick around, but we argued that how it is today won’t last. Data about people may continue to be accrued, but there might be a limit of data saturation beyond which we can’t do anything useful. We wondered if, currently, there is a selectivity to the data which LLMs learn from (e.g. articles with a certain number of downloads). This might be a reasonable expectation of the direction large AI models will take in the future: improving the datasets selected for learning from, rather than improving the model architecture itself.

A problem with datasets used to train current LLMs is that a lot of artists’ work has been appropriated. Art is another important opportunity for reassessing data labour; if we think of artists as data labourers, we may reassess the value of what they produce. If profit is generated by including art in datasets used to train AI, this should be an opportunity to give more value to artists. Rather than completely dismantling capitalism and recognising every fragment of labour, we should adjust our perspective on what real labour is. We should supply universal basic income, and give more recognition to the greater feats, like art. However, aligning the cultural value of art with financial compensation is hugely challenging.

How would you incentivise or enact the changes mentioned in the papers roadmap in your field?#

One way to restructure financial compensation for data labour would be to facilitate the ability to dictate that your labour should not be included. This would dramatically reduce the sizes of datasets and make it in companies’ interests to incentivise people to include their labour (e.g. artwork). Adobe are promising that artists will be fairly compensated by their new generative AI tool. A large proportion of Adobe’s customers are artists and designers whose jobs might be on the line because of generative AI. To retain their customer base, they have to provide some method of compensation. However, for other companies there seems like there is little incentive to compensate for data labour. Currently, you could easily have a business model which steals art and sells it with little consequence.

Reanalysing financial compensation around labour should to be accompanied by more open conversation around salary levels. Many of us find that it is very taboo to talk about how much you earn, however being more transparent about our salaries might be a step towards improved fairness. Companies don’t just take advantage of this taboo, they seem to actively manufacture it. We should encourage talking about this more in our fields.

Bonus Question: What change would you like to see on the basis of this piece? Who has the power to make that change?#

Other ways we could enact change include antagonistic data collection, such as lying on forums, or putting overlays on top of images so that AI can’t read it (image poisoning). It is people who are generating the resource, so it is also people who could make it completely useless if they wanted to. Yet the vastness of the internet makes this challenging; you can’t get the whole internet to agree not to use a platform. Everyone has principles and personal ethics around AI, until they realise they can make Disney posters of their pets. On the other hand, we have just seen a large strike in Hollywood, pushing the film industry to the edge. The reasons surrounding this strike predominantly related to scriptwriters’ fears around AI, and lack of residual payments for writers from streaming services. This strike was fairly successful, and could give us some insight on what’s to come regarding automation in the workforce.

To further progress, we need to focus on education and awareness. People should learn about the whole spectrum of how we participate in data labour, and all the different ways in which we contribute to data production. Considering how much time children spend online today, data awareness must be taught in schools.

Regarding the paper itself, we thought that a table would have worked far better than a long paper. Most people do not have time to comb through such a long piece. It would be more effective to have a condensed highlight of the 6 dimensions with actionable points.

Attendees#

  • Nina Di Cara - RA @ Uni of Bristol

  • Jessica Woodgate, PhD student, University of Bristol

  • Huw Day, JGI Data Scientist, University of Bristol

  • Natalie Zelenka, Senior Research Fellow (Health Data Science) @ UCL

  • Melanie Stefan, Medical School Berlin

  • Amy Joint, Publishing Manager, F1000

  • Lucy Bowles, Data Scientist, Brandwatch

  • Helen Sheehan, PhD student, University of Bristol

  • Euan Bennet, Lecturer, University of Glasgow

  • Kamilla Wells, Citizen Developer, Queensland Public Service, Australia