Data Ethics Club: OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us#

What’s this?

This is summary of Wednesday 19th February’s Data Ethics Club discussion, where we spoke and wrote about the New Republic article OpenAI Furious DeepSeek Might Have Stolen All the Data OpenAI Stole From Us by Jason Koebler. The summary was written by Jessica Woodgate, who tried to synthesise everyone’s contributions to this document and the discussion. “We” = “someone at Data Ethics Club”. Huw Day helped with the final edit.

Article Summary#

In January 2025, DeepSeek – a Chinese AI startup specialising in large language models (LLMs) – released its R1 model, which shocked the world by demonstrating better performance than OpenAI’s models whilst, DeepSeek claim, costing significantly less money and using older chips. Unlike other LLMs, DeepSeek’s approach develops reasoning capabilities using purely reinforcement learning (RL) techniques.

Whilst the model was released open source and alongside a (non-peer reviewed) research paper, questions remain regarding the mechanics of R1’s training. Model distillation is where smaller, faster “student” models are created by compressing knowledge of a larger, more complex “teacher” model. The student model is trained to replicate the teacher’s output distributions by asking the teacher lots of questions and mimicking the teacher’s reasoning process. Suspicious that R1 was trained using distillation, Microsoft and OpenAI are investigating whether DeepSeek used output obtained in an unauthorised manner from OpenAI.

The issue is raising questions about where the line is between using available resources to drive progress and stealing. In terms of using available resources, DeepSeek could be seen as following standard software development practice, wherein research generally works iteratively, taking into account what has been done before and building on that to create something novel. If DeepSeek’s methodology is cast as stealing, OpenAI could be seen as hypocritical. OpenAI itself has been widely criticised for scraping large amounts of data from the internet to train its systems, and is currently being sued by the New York Times for unpermitted use of articles to train LLMs.

Discussion Summary#

Should these companies be publicly owed as a way to nationalise or democratise AI, since the models are trained on publicly available data?#

The question of whether a company should be publicly owned is a political question, not an ethical or technical question. On the one hand, if AI is really going to be as transformative as the hype bubble suggests, perhaps it should be publicly accountable instead of led by commercial interest. Currently, decision making in AI development is top-down where a few big players have control over driving innovation and setting benchmarks. When using LLMs, you are implicitly subscribing to the ideologies of the companies and power structures behind their development. Redirecting the driving force and guidance for AI development to come from the ground up may help society regain control over the direction of travel and ensure developments are for the benefit of all. Democratising generative AI could involve some sort of voting system. On the other hand, it is not clear how this would work in practice, given the logistically challenges and amount of money that is involved in the AI sector.

Governments are becoming more explicitly involved in the development of AI, and national support does make a difference in the AI sector. Empowering the general population to have influence over AI development will require improved AI literacy so that we can have more informed conversations. The lack of literacy drives hype cycles, which are societal dynamics that work by triggering emotions to subdue political and regulatory questions. However, empowering the public is different to nationalising a model to make it publicly owned.

Another approach to harnessing the benefits of language models for society could be to train small language models on company documentation, rather than using large language models trained on web scraped data. Perhaps companies should be required to pay licensing fees in order to use publicly available data for training. Although, environmental concerns for the resources required to train models remain.

What do you think best practices should be for model distillation (i.e. one model learning off another)? Is it really stealing if OpenAI trained their model using vast amounts of publicly available data collected through web scraping?#

Best practices for training models should prioritise transparency, in order to facilitate public scrutiny and hold companies accountable. In contexts like medicine where the stakes are high, transparency is key to maintain respect for patients’ autonomy. One way to achieve transparency is by making models open source, which means making the design of models publicly accessible so that people can modify and share it. The programming language python, for example, is open source. The military uses open source software; the US Department of Defence policy requires that commercial software comes with either a warranty or source code to that software can be maintained. Open source software has played an important role in intelligence in the Russia-Ukraine war.

As well as transparency, open source brings advantages for security by inviting scrutiny from a wider audience, allowing for vulnerabilities to be identified quickly. When models are open source, providing model cards further increases transparency by making it easier to see where copy-pasting or code reuse has occurred, for instance. However, the efficacy of model cards in practice is debatable as there has been mixed evidence for how consistently they are filled out. When tools are open source there are also accountability challenges, as it is difficult to allocate ownership and trace work back to its original input, essentially anonymising the work.

We wondered whether it is appropriate for OpenAI to still have ‘open’ in their name, when it no longer centralises open source but has shifted its structure and values to prioritise the pursuit of artificial general intelligence (AGI) and profit. OpenAI has an unusual business model, as it transitioned from a non-profit organisation created by influential leaders in technology to a for-profit organisation in partnership with Microsoft. Revenue prominently comes from licensing agreements, subscription services, and partnerships (notably Microsoft with a $1 billion investment in 2019). OpenAI (and DeepSeek, especially if distillation was used) have gained significant economic benefit from exploiting grey areas of property rights.

Contrasting the approaches of other companies like OpenAI, DeepSeek have chosen to make R1’s weights open source in addition to sharing technical details in a publicly available report. In sharing these details, DeepSeek demonstrate more transparency than many other LLMs, however, there are still aspects of opacity such as questions about what chips were used in training, how much it really cost to train, and what happens to users’ data. Systems have been criticised as claiming to be ‘open’ yet remaining closed in important ways. To investigate the missing pieces in understanding how R1 works, HuggingFace are attempting to reverse engineer the R1 pipeline from DeepSeek’s tech report.

If DeepSeek did use distillation, we wondered why OpenAI have not used the same techniques to improve performance. Explanations could include that OpenAI has tried distillation, and it didn’t work as well or model collapse happened. Model collapse is where training generative AI models on AI generated content leads to a decline in performance, because generative AI models produce datasets with less variation than the original data distributions. Alternatively, perhaps OpenAI are over committed to their existing architecture, in which case R1 could be positive catalyst for innovation by demonstrating ways to break out of existing dogma.

Regarding the attitudes of tech companies towards model distillation, rights of use, and property rights, it is interesting to investigate the appropriateness of OpenAI calling distillation stealing. Whilst there are ambiguities about how ethical it is to distil models without explicit consent or train models from data scraped off of the internet, it is not obvious whether this equates to stealing. On the one hand, if you are stealing from something that stole off something else, it is still stealing. On the other hand, reusing other people’s code is common practice in dev culture, where projects have chains of code reuse as developers use code from other people who used it from other people themselves. The TV show Silicon Valley explores the effects of disruptive technologies and issues of property rights that accompany them. By labelling DeepSeek as stealing, OpenAI has been criticised as hypocritical for protesting about practices that can be viewed as analogous to its own practices, reflecting similarities between distillation and OpenAI scraping publicly available data.

Cartoon of programming and plagiarism: ‘Middle school: “plagiarism is unacceptable; High school: “plagiarism is unacceptable”; University: “plagiarism is unacceptable”; Work: Programmers “Man, I stole your code.” “It’s not my code.”

There are interesting anthropological perspectives on the labelling of actions as stealing, for example, the influence of global politics: actors from western countries tend to be labelled as “innovators” whilst actors from other countries behaving in similar ways are labelled as “thieves”. The anthropologist Cori Hayden has written about the enclosures of public knowledge, and how the line between proper and improper copy is policed and influenced by global politics.

DeepSeek was trained more cost effectively and with less powerful hardware but still performed as well as OpenAI’s model, attributed in part to its new architecture rather than just throwing more data + compute at the problem. Do you think that constrained environments can generally be a good catalyst for innovation?#

Much of the push for AI innovation comes from the idea of the “AI arms race” between primarily the US and China. We wondered how much we should care about this supposed race; whether it is advancing technological development, risking the destruction of humanity or just bluster orchestrated to further cement the position of dominant players in the economy. It is unclear what countries involved in the arms race want to do with the AI they develop, whether they want to use it for the benefit of citizens by creating economic value, or for the military and warfare.

Supporting China’s position in the supposed arms race, Chinese media has focused on the technological breakthroughs achieved in R1, arguing that DeepSeek demonstrates China’s increasing capability to develop cutting-edge models independent to Western technology. However, there is conflicting evidence surrounding how R1 was trained. DeepSeek claims it did not use Nvidia H100 chips, which are banned in China under US export controls, but some Chinese reporting states that DeepSeek did train R1 on Nvidia H100 chips.

Whilst there is debate around R1’s novelty, there are other examples of truly significant innovation that have come out of the AI sector. For example, the paper “Attention Is All You Need” introduced the transformer architecture, a type of deep learning model that is highly effective in capturing dependencies and contextual relationships. Transformers use attention mechanisms to focus on specific parts of the input sequence. Attention works especially well in natural language processing (NLP) tasks where the meaning of a sentence is generally influenced by its context. The paper provided the foundation for powerful large language models that harness generative pre-trained transformers (GPTs) to enable models to identify relationships between words and thereby retain relevant context. Transformers have been revolutionary in NLP domains such as translation, and are also being used for time series forecasting, which predicts future trends based on historical data.

Generally, however, we feel that “innovation” is largely used as a buzzword to inflate the capabilities of creators and importance of the tools they provide to move towards replacing human involvement with AI. We found it interesting when phrasing such as “knowledge” and making iterations “smarter” is used, when advances are essentially refinements of statistical probabilities. We define innovation as “developing novel methods to address new problems”. There is nothing inherently wrong with the pursuit of knowledge. Humans love a puzzle; finding creative solutions to problems is an essential part of the human experience where “necessity is the mother of invention”. The crux is how that knowledge is used.

Whilst there has been hype and overexaggeration of the capabilities of generative AI, it is important to acknowledge that the functionality of ChatGPT has evolved over time, moving from conversation sequencing, which utilises what a speaker has said to make a relevant response and ensure steps in a conversation are related, to collaboration and provoking more actionable interactions. ChatGPT has been criticised for being difficult to verify as it does not provide citations, although it is getting better at referencing sources and there are ways to encourage it to cite.

Innovation should be focused to prevent resources being spent on generating tools that contribute little value; constraints mean that we can direct the aims of innovation. Some restrictions on development are necessary to ensure that products are ethical and do not exploit people. Innovation for the sake of innovation and the move fast and break things paradigm has real effects on real peoples’ lives. Constraints shouldn’t be aimed at preventing super intelligent AI, but should be focused on the actual harms that are happening today, such as misuse of data and environmental impact. To foster benevolent innovation, players should be supported in pursuing knowledge, and then letting society choose the constraints for the application of that innovation.

In the domain of language models, constraining the size of models could enhance human creativity, rather than restrict or bypass it. When we have access to large and adaptable LLMs, it can be tempting to try and use those tools for every problem. Yet, these tools frequently hallucinate and have substantial environmental impact in terms of resources consumed in training and data storage.

Constraints could also improve the quality of content being generated. Initially, there was a presumption that the more data, the better. However, the fact that it is possible to obtain more data does not entail that the result will be better. There are real costs to having too much data. LLMs are also contributing to the enshittification of the internet, as the stakeholders being valued shifts from the users to shareholders. Devaluing the experience of users leads to the web slowly being filled with generated low quality slop by AI enhanced search engine optimisation (SEO) farming websites. More data entails more noise, masking any important and meaningful content.

There are physical resource constraints that digital technologies are subject to, despite the conceptual distancing. Big tech’s energy demands are set to steadily increase, and there are competing interests between their demands and the demands of the state for energy generation. Datacentres also use an huge amount of water to keep compute equipment cool, leading to concerns over water scarcity and desertification, especially as datacentres are being built in already water restricted areas such as Arizona.

What change would you like to see on the basis of this piece? Who has the power to make that change?#

LLMs have potential to do harm or good on a substantial level. As LLMs are trained on data generated by society, they tend to reflect the stereotypes and prejudices that exist in the data they were trained on and thereby society. If there is more information about one group of people in the training data, the model will more accurately predict that group. If a model favours one group above another, it should not be deployed as it will exacerbate existing and possibly create further inequalities in populations that are already disadvantaged.

The models that are biased and have been deployed, however, have shed a spotlight on inequalities that exist in society. Making these inequalities visible could help us to address them. Humans are biased, but it can be difficult to prove. For example, visiting a clinician in hospital could lead you to suspect that biases are influencing their diagnosis, yet it is often difficult to be sure of this. With AI models, it is possible to quantitatively measure bias and pinpoint where it exists.

Attendees#

  • Huw Day, Data Scientist, University of Bristol: LinkedIn, BlueSky

  • Jessica Woodgate, PhD Student, University of Bristol

  • Euan Bennet, Lecturer, University of Glasgow, BlueSky

  • Christina Palantza, PhD student, University of Bristol

  • Vanessa Hanschke, Lecturer, University College London

  • Arun Isaac, postdoc, University College London

  • Michelle Venetucci, PhD student, Yale University

  • Kamilla Wells, Citizen Developer, Australian Public Service, Brisbane

  • Mirah Jing Zhang, PhD student, Bristol Uni.

  • Adrianna Jezierska, PhD student, University of Bristol LinkedIn