Data Ethics Club: OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us #

What’s this?

This is summary of Wednesday 19th February’s Data Ethics Club discussion, where we spoke and wrote about the New Republic article OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us by Jason Koebler. The summary was written by Jessica Woodgate, who tried to synthesise everyone’s contributions to this document and the discussion. “We” = “someone at Data Ethics Club”. Huw Day helped with the final edit.

Article Summary#

In January 2025, DeepSeek – a Chinese AI startup specialising in large language models (LLMs) – released its R1 model, which shocked the world by demonstrating better performance than OpenAI’s models whilst, DeepSeek claim, costing significantly less money and using older chips. Unlike other LLMs, DeepSeek’s approach develops reasoning capabilities using purely reinforcement learning (RL) techniques.

Whilst the model was released open source and alongside a (non-peer reviewed) research paper, questions remain regarding the mechanics of R1’s training. Model distillation is where smaller, faster “student” models are created by compressing knowledge of a larger, more complex “teacher” model. The student model is trained to replicate the teacher’s output distributions by asking the teacher lots of questions and mimicking the teacher’s reasoning process. Suspicious that R1 was trained using distillation, Microsoft and OpenAI are investigating whether DeepSeek used output obtained in an unauthorised manner from OpenAI.

The issue is raising questions about where the line is between using available resources to drive progress and stealing. In terms of using available resources, DeepSeek could be seen as following standard software development practice, wherein research generally works iteratively, taking into account what has been done before and building on that to create something novel. If DeepSeek’s methodology is cast as stealing, OpenAI could be seen as hypocritical. OpenAI itself has been widely criticised for scraping large amounts of data from the internet to train its systems, and is currently being sued by the New York Times for unpermitted use of articles to train LLMs.

Discussion Summary#

Under what circumstances do you think training AI models using publicly available internet materials is fair use? How should copyright come into this?#

A side-effect of advances in AI is that copyright law is being put under the spotlight, as current legislation does not provide a straightforward answer regarding AI companies using public data. Currently, in the US anyone may use a work that is in the public domain, but no-one can own it. Derivative work is based on a work that has already been copyrighted, so that the new work derives from the previous work. If you own copyright to a work, you also have right to derivative works, however, if the material is deemed sufficiently original and creative it is copyrightable by itself.

There are numerous ongoing lawsuits involving technology giants participating in AI development that claim the companies have trained their models on copyrighted materials without authorisation. Typically, defendants argue that they are protected by fair use, a law that permits limited use of copyrighted material without obtaining permission. The intention of fair use is to protect creative works by permitting the use of copyrighted items to create something new, provided that the new artefact is sufficiently transformative. Some of the arguments in favour of generative AI as fair use rest on the premise that the outputs are not copies of the inputs and are thus similar to being inspired by something. From this premise, transformative use is argued as fair and not competing directly with what was inputted to create it. However, many creators and copyright owners disagree that the appropriation of their material to train models constitutes fair use.

To discern fair use of publicly available internet materials, it might help to clarify if and why there is a distinction between individuals reading and utilising publicly available data, and entities using publicly available data to train models. Yet, this distinction is not obvious to us. Using publicly available data to train models could be deemed as fair provided that users of publicly available data document what data is used, and how the data is used. Even if this definition is sufficient, we remained unsure whether the practices carried out by big tech align with fair use.

Disagreements surrounding fair use and generative AI are exemplified in the case of Kadrey v. Meta, wherein the plaintiffs claim direct copyright infringement based on derivative work theory. The plaintiffs allege that their books were used in training of Meta’s LLaMA model: they argue that Meta trained LLaMA on LibGen, a library of pirated books, and this decision was greenlit by Zuckerberg. Disclosed emails reveal that Meta staff discussed methods to filter text from LibGen to remove copyright indications, and some raised concerns about using torrenting to obtain data. It will be interesting to see what the effects of lawsuits that big tech and especially OpenAI is involved in will have on the next steps for DeepSeek.

Ambiguities in copyright law may be relevant to issues wider than AI, as the way that data is produced and handled is continually changing. On the consumer level, concerns about the handling of data have arisen in recent years from the ways that big tech companies collect and sell data to advertising agents. Beyond this, the suspected distillation of ChatGPT by DeepSeek represents how data collection is once more undergoing a rapid evolution, moving from scraping data on the internet to refining the product of that scraped data. Looking into the future, there are further implications for how data collection could be commercialised, such as conducting input analysis on the tasks being asked to ChatGPT, then mapping analysis to a customer base for fulfilling those tasks.

Copyright will need to respond to the changing implications of data handling and knowledge generation; whilst it is not infallible, it is the system that we have and should be worked with to evolve with the problems it is aiming to address. For example, if web scraping is unfair, it is important to consider how copyright can be integrated into the web scraping space. In the UK, the data (use and access) bill is currently before parliament in the House of Commons. One of the suggestions in the bill is to improve transparency over crawlers, which are programs that browse the web to find sites to add to search engines. If a website doesn’t want some or all of its pages to appear on a search engine, the crawl exclusion list is a file (called robots.txt) that dictates to crawlers which web pages to exclude. Enforcement of the bill would require operators of web crawlers and general-purpose AI models to disclose the identity of crawlers they use.

Amending and enforcing legislation has to be backed by the government in power, yet as governments and their agendas change, backing behind particular legislation may change. In the US, copyright law has historically been advocated by government. However, the current government’s deregulation goals for big tech may conflict with desires to update copyright law to protect consumer data.

Regulating copyright and fair use is important to protect the rights of data producers and the experience of people that use the internet. Knowledge that whatever we put online has the potential to be used and regurgitated in different forms via generative AI could heighten feelings of surveillance. It is not just the rights of those who produce the data that make copyright law important, but also the effects of the model once it has been trained on that data.

Outside of fair use and copyright law, we felt that people should not be confined to only feel anger when their copyright is infringed. Some of us thought that if you don’t own data or have explicit consent for using it or for web scraping, you should not use the data to train LLMs. Companies should respect clear opt outs, as is expected in Europe. Europe even goes beyond respecting opt outs to requiring opt ins. Data used to train LLMs has been taken from publicly available material which was not provided with consent to be reused in this manner.

Should these companies be publicly owed as a way to nationalise or democratise AI, since the models are trained on publicly available data?#

The question of whether a company should be publicly owned is a political question, not an ethical or technical question. On the one hand, if AI is really going to be as transformative as the hype bubble suggests, perhaps it should be publicly accountable instead of led by commercial interest. Currently, decision making in AI development is top-down where a few big players have control over driving innovation and setting benchmarks. When using LLMs, you are implicitly subscribing to the ideologies of the companies and power structures behind their development. Redirecting the driving force and guidance for AI development to come from the ground up may help society regain control over the direction of travel and ensure developments are for the benefit of all. Democratising generative AI could involve some sort of voting system. On the other hand, it is not clear how this would work in practice, given the logistically challenges and amount of money that is involved in the AI sector.

Governments are becoming more explicitly involved in the development of AI, and national support does make a difference in the AI sector. Empowering the general population to have influence over AI development will require improved AI literacy so that we can have more informed conversations. The lack of literacy drives hype cycles, which are societal dynamics that work by triggering emotions to subdue political and regulatory questions. However, empowering the public is different to nationalising a model to make it publicly owned.

Another approach to harnessing the benefits of language models for society could be to train small language models on company documentation, rather than using large language models trained on web scraped data. Perhaps companies should be required to pay licensing fees in order to use publicly available data for training. Although, environmental concerns for the resources required to train models remain.

What do you think best practices should be for model distillation (i.e. one model learning off another)? Is it really stealing if OpenAI trained their model using vast amounts of publicly available data collected through web scraping?#

Best practices for training models should prioritise transparency, in order to facilitate public scrutiny and hold companies accountable. In contexts like medicine where the stakes are high, transparency is key to maintain respect for patients’ autonomy. One way to achieve transparency is by making models open source, which means making the design of models publicly accessible so that people can modify and share it. The programming language python, for example, is open source. The military uses open source software; the US Department of Defence policy requires that commercial software comes with either a warranty or source code to that software can be maintained. Open source software has played an important role in intelligence in the Russia-Ukraine war.

As well as transparency, open source brings advantages for security by inviting scrutiny from a wider audience, allowing for vulnerabilities to be identified quickly. When models are open source, providing model cards further increases transparency by making it easier to see where copy-pasting or code reuse has occurred, for instance. However, the efficacy of model cards in practice is debatable as there has been mixed evidence for how consistently they are filled out. When tools are open source there are also accountability challenges, as it is difficult to allocate ownership and trace work back to its original input, essentially anonymising the work.

We wondered whether it is appropriate for OpenAI to still have ‘open’ in their name, when it no longer centralises open source but has shifted its structure and values to prioritise the pursuit of artificial general intelligence (AGI) and profit. OpenAI has an unusual business model, as it transitioned from a non-profit organisation created by influential leaders in technology to a for-profit organisation in partnership with Microsoft. Revenue prominently comes from licensing agreements, subscription services, and partnerships (notably Microsoft with a $1 billion investment in 2019). OpenAI (and DeepSeek, especially if distillation was used) have gained significant economic benefit from exploiting grey areas of property rights.

Contrasting the approaches of other companies like OpenAI, DeepSeek have chosen to make R1’s weights open source in addition to sharing technical details in a publicly available report. In sharing these details, DeepSeek demonstrate more transparency than many other LLMs, however, there are still aspects of opacity such as questions about what chips were used in training, how much it really cost to train, and what happens to users’ data. Systems have been criticised as claiming to be ‘open’ yet remaining closed in important ways. To investigate the missing pieces in understanding how R1 works, HuggingFace are attempting to reverse engineer the R1 pipeline from DeepSeek’s tech report.

If DeepSeek did use distillation, we wondered why OpenAI have not used the same techniques to improve performance. Explanations could include that OpenAI has tried distillation, and it didn’t work as well or model collapse happened. Model collapse is where training generative AI models on AI generated content leads to a decline in performance, because generative AI models produce datasets with less variation than the original data distributions. Alternatively, perhaps OpenAI are over committed to their existing architecture, in which case R1 could be positive catalyst for innovation by demonstrating ways to break out of existing dogma.

Regarding the attitudes of tech companies towards model distillation, rights of use, and property rights, it is interesting to investigate the appropriateness of OpenAI calling distillation stealing. Whilst there are ambiguities about how ethical it is to distil models without explicit consent or train models from data scraped off of the internet, it is not obvious whether this equates to stealing. On the one hand, if you are stealing from something that stole off something else, it is still stealing. On the other hand, reusing other people’s code is common practice in dev culture, where projects have chains of code reuse as developers use code from other people who used it from other people themselves. The TV show Silicon Valley explores the effects of disruptive technologies and issues of property rights that accompany them. By labelling DeepSeek as stealing, OpenAI has been criticised as hypocritical for protesting about practices that can be viewed as analogous to its own practices, reflecting similarities between distillation and OpenAI scraping publicly available data.

Cartoon of programming and plagiarism: ‘Middle school: “plagiarism is unacceptable; High school: “plagiarism is unacceptable”; University: “plagiarism is unacceptable”; Work: Programmers “Man, I stole your code.” “It’s not my code.”

There are interesting anthropological perspectives on the labelling of actions as stealing, for example, the influence of global politics: actors from western countries tend to be labelled as “innovators” whilst actors from other countries behaving in similar ways are labelled as “thieves”. The anthropologist Cori Hayden has written about the enclosures of public knowledge, and how the line between proper and improper copy is policed and influenced by global politics.

DeepSeek was trained more cost effectively and with less powerful hardware but still performed as well as OpenAI’s model, attributed in part to its new architecture rather than just throwing more data + compute at the problem. Do you think that constrained environments can generally be a good catalyst for innovation?#

Much of the push for AI innovation comes from the idea of the “AI arms race” between primarily the US and China. We wondered how much we should care about this supposed race; whether it is advancing technological development, risking the destruction of humanity or just bluster orchestrated to further cement the position of dominant players in the economy. It is unclear what countries involved in the arms race want to do with the AI they develop, whether they want to use it for the benefit of citizens by creating economic value, or for the military and warfare.

Supporting China’s position in the supposed arms race, Chinese media has focused on the technological breakthroughs achieved in R1, arguing that DeepSeek demonstrates China’s increasing capability to develop cutting-edge models independent to Western technology. However, there is conflicting evidence surrounding how R1 was trained. DeepSeek claims it did not use Nvidia H100 chips, which are banned in China under US export controls, but some Chinese reporting states that DeepSeek did train R1 on Nvidia H100 chips.

Whilst there is debate around R1’s novelty, there are other examples of truly significant innovation that have come out of the AI sector. For example, the paper “Attention Is All You Need” introduced the transformer architecture, a type of deep learning model that is highly effective in capturing dependencies and contextual relationships. Transformers use attention mechanisms to focus on specific parts of the input sequence. Attention works especially well in natural language processing (NLP) tasks where the meaning of a sentence is generally influenced by its context. The paper provided the foundation for powerful large language models that harness generative pre-trained transformers (GPTs) to enable models to identify relationships between words and thereby retain relevant context. Transformers have been revolutionary in NLP domains such as translation, and are also being used for time series forecasting, which predicts future trends based on historical data.

Generally, however, we feel that “innovation” is largely used as a buzzword to inflate the capabilities of creators and importance of the tools they provide to move towards replacing human involvement with AI. We found it interesting when phrasing such as “knowledge” and making iterations “smarter” is used, when advances are essentially refinements of statistical probabilities. We define innovation as “developing novel methods to address new problems”. There is nothing inherently wrong with the pursuit of knowledge. Humans love a puzzle; finding creative solutions to problems is an essential part of the human experience where “necessity is the mother of invention”. The crux is how that knowledge is used.

Whilst there has been hype and overexaggeration of the capabilities of generative AI, it is important to acknowledge that the functionality of ChatGPT has evolved over time, moving from conversation sequencing, which utilises what a speaker has said to make a relevant response and ensure steps in a conversation are related, to collaboration and provoking more actionable interactions. ChatGPT has been criticised for being difficult to verify as it does not provide citations, although it is getting better at referencing sources and there are ways to encourage it to cite.

Innovation should be focused to prevent resources being spent on generating tools that contribute little value; constraints mean that we can direct the aims of innovation. Some restrictions on development are necessary to ensure that products are ethical and do not exploit people. Innovation for the sake of innovation and the move fast and break things paradigm has real effects on real peoples’ lives. Constraints shouldn’t be aimed at preventing super intelligent AI, but should be focused on the actual harms that are happening today, such as misuse of data and environmental impact. To foster benevolent innovation, players should be supported in pursuing knowledge, and then letting society choose the constraints for the application of that innovation.

In the domain of language models, constraining the size of models could enhance human creativity, rather than restrict or bypass it. When we have access to large and adaptable LLMs, it can be tempting to try and use those tools for every problem. Yet, these tools frequently hallucinate and have substantial environmental impact in terms of resources consumed in training and data storage.

Constraints could also improve the quality of content being generated. Initially, there was a presumption that the more data, the better. However, the fact that it is possible to obtain more data does not entail that the result will be better. There are real costs to having too much data. LLMs are also contributing to the enshittification of the internet, as the stakeholders being valued shifts from the users to shareholders. Devaluing the experience of users leads to the web slowly being filled with generated low quality slop by AI enhanced search engine optimisation (SEO) farming websites. More data entails more noise, masking any important and meaningful content.

There are physical resource constraints that digital technologies are subject to, despite the conceptual distancing. Big tech’s energy demands are set to steadily increase, and there are competing interests between their demands and the demands of the state for energy generation. Datacentres also use an huge amount of water to keep compute equipment cool, leading to concerns over water scarcity and desertification, especially as datacentres are being built in already water restricted areas such as Arizona.

What change would you like to see on the basis of this piece? Who has the power to make that change?#

LLMs have potential to do harm or good on a substantial level. As LLMs are trained on data generated by society, they tend to reflect the stereotypes and prejudices that exist in the data they were trained on and thereby society. If there is more information about one group of people in the training data, the model will more accurately predict that group. If a model favours one group above another, it should not be deployed as it will exacerbate existing and possibly create further inequalities in populations that are already disadvantaged.

The models that are biased and have been deployed, however, have shed a spotlight on inequalities that exist in society. Making these inequalities visible could help us to address them. Humans are biased, but it can be difficult to prove. For example, visiting a clinician in hospital could lead you to suspect that biases are influencing their diagnosis, yet it is often difficult to be sure of this. With AI models, it is possible to quantitatively measure bias and pinpoint where it exists.

Attendees#

Huw Day, Data Scientist, University of Bristol: LinkedIn, BlueSky
Jessica Woodgate, PhD Student, University of Bristol
Euan Bennet, Lecturer, University of Glasgow, BlueSky
Christina Palantza, PhD student, University of Bristol
Vanessa Hanschke, Lecturer, University College London
Arun Isaac, postdoc, University College London
Michelle Venetucci, PhD student, Yale University
Kamilla Wells, Citizen Developer, Australian Public Service, Brisbane
Mirah Jing Zhang, PhD student, Bristol Uni.
Adrianna Jezierska, PhD student, University of Bristol LinkedIn

Data Ethics Club: “It’s Not Exactly Meant to Be Realistic”: Student Perspectives on the Role of Ethics In Computing Group Projects Data Ethics Club: How It’s Unfair to Use Personality Tests in Hiring (International Women’s Day Special)

19 February 2025

Recent Posts

Categories

Tags

Authors

Archives

Data Ethics Club: OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us #

Article Summary#

Discussion Summary#

Under what circumstances do you think training AI models using publicly available internet materials is fair use? How should copyright come into this?#

Should these companies be publicly owed as a way to nationalise or democratise AI, since the models are trained on publicly available data?#

What do you think best practices should be for model distillation (i.e. one model learning off another)? Is it really stealing if OpenAI trained their model using vast amounts of publicly available data collected through web scraping?#

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Attendees#

This Page

19 February 2025

Recent Posts

Categories

Tags

Authors

Archives

Data Ethics Club: OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us#

Article Summary#

Discussion Summary#

Under what circumstances do you think training AI models using publicly available internet materials is fair use? How should copyright come into this?#

Should these companies be publicly owed as a way to nationalise or democratise AI, since the models are trained on publicly available data?#

What do you think best practices should be for model distillation (i.e. one model learning off another)? Is it really stealing if OpenAI trained their model using vast amounts of publicly available data collected through web scraping?#

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Attendees#

This Page

Data Ethics Club: OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us #