--- blogpost: true date: March 19th, 2025 author: Jessica Woodgate category: Write Up tags: synthetic biology, AI, AlphaFold, explainability, protein folding --- # Data Ethics Club: [The Most Useful Thing AI Has Ever Done](https://www.youtube.com/watch?v=P_fHJIYENdI): AlphaFold ```{admonition} What's this? This is summary of Wednesday 19th March’s Data Ethics Club discussion, where we spoke and wrote about the video [The Most Useful Thing AI Has Ever Done](https://www.youtube.com/watch?v=P_fHJIYENdI). The summary was written by Jessica Woodgate, who tried to synthesise everyone's contributions to this document and the discussion. "We" = "someone at Data Ethics Club". Huw Day helped with the final edit. ``` ## Article Summary Proteins are fundamental building blocks of living organisms, performing a vast array of functions from catalysing metabolic reactions to providing structure to cells. Proteins consist of strings of amino acids that coil up on themselves when they are pushed and pulled. The way a protein is coiled or folded is directly tied to its function, so understanding the structure of the protein relays information about how the protein works, what other molecules it interacts with, and how mutations may cause diseases. This information can be applied to drug design and discovery, or breaking down environmentally harmful substances such as methane and plastic. Initially, experimental techniques to determine protein structure worked by creating a crystal out of the protein, which is lengthy and costly; the first protein structure took 12 years to develop. Over six decades, scientists working on protein folding had discovered about 150,000 protein structures. AlphaFold took the field by storm, unveiling over 200 million protein structures – nearly all proteins known to exist in nature. Trained on protein structures from the [Protein Data Bank](https://www.rcsb.org/), AlphaFold2 used [transformers](https://www.datacamp.com/tutorial/how-transformers-work) - a deep learning architecture which relies on [attention](https://www.geeksforgeeks.org/ml-attention-mechanism/). Attention has the effect of adding context to sequential information by breaking it down into chunks, converting these into numerical representations (embeddings), and making connections between the embeddings. Attention is used in large language models (LLMs) to predict the most useful word in a sentence; AlphaFold uses attention to predict amino acid sequences. The transformer in AlphaFold has two “towers”: one biology tower, containing information about how amino acids change across living species, and one geometry tower, containing information about pair representations or which amino acids are related in the structure. As the towers process inputs, they pass “clues” to one another, exchanging information 48 times before sending the geometric features that they have learnt to the structure module. The structure module predicts the appropriate translation and rotation of the amino acids, and outputs a 3D protein. The whole cycle is performed another 3 times before outputting a prediction. ## Discussion Summary ### Who “owns” the knowledge generated by AlphaFold, and should there be special patent considerations for discoveries made primarily via AI-driven insights, such as big pharmaceutical companies discovering new drug compounds? To assign ownership of knowledge generated by AlphaFold, one could look at who is funding its development and deployment, such as who is paying for material requirements like the electricity bill and any grants used. If a company (in this case, [DeepMind](https://deepmind.google/)) has ownership of the technology, the credit will commonly be assigned to them. Those that have funded or patented a tool may have ownership of the *technology*, however, this is arguably different to owning the *knowledge* that a tool produces. Designating ownership of knowledge is a more challenging task than ownership of technology. Difficulties with assigning ownership of knowledge are illustrated in the different issues that arise with author attribution. In theory, individual attribution seems like a good idea, but in large projects discerning authorship is often not a simple task. Different disciplines assign first or last author in different ways, for example, some list authors alphabetically and some list the lead supervisor last. In some academic disciplines and institutions, e.g. [CERN](https://cds.cern.ch/collection/CMS%20Papers), papers have long lists of authors. If everybody is listed as an equal author, it is difficult to assign credit at a more granular level. However, in large communities such as CERN, perhaps people prefer to share the credit of collective efforts. If the outputs of AlphaFold are integral to a paper, one may wonder if AlphaFold itself has some degree of knowledge ownership and should be listed as a contributor. Yet, we did wonder if the outputs themselves are something that would be publishable, or if it is just information that needs to be applied and turned into something novel to provide a valuable research contribution. In addition, we should be cautious of distancing automated tools from the humans behind them. Even if we don’t fully understand the inner workings of a tool or its application, it is important to remember that in the background somewhere there is still a human making inputs to a model that then goes on to make predictions. Maintaining focus on the actual people that develop and deploy a tool is key to attribute accountability. AlphaFold is an example of how powerful computing can be in accelerating certain problem solutions, but we must acknowledge it was people that did the work in developing the tool and accountability still comes down to them. In addition to those directly involved in the development and deployment of the tool, the general public may have some rights to knowledge generated by AlphaFold. AlphaFold was trained on the [Protein Data Bank](https://www.rcsb.org/), which is a public repository of over 170,000 protein sequences and structures. The code for AlphaFold 2 [is open source](https://github.com/google-deepmind/alphafold); those of us who work in similar areas thought this is quite unusual. AlphaFold3 was initially released as [a web server limited to non-commercial use](https://alphafoldserver.com/welcome) in [lieu of code](https://www.nature.com/articles/d41586-024-01463-0). The opacity [was criticised](https://www.genengnews.com/topics/artificial-intelligence/alphafold-3-angst-limited-accessibility-stirs-outcry-from-researchers/) including in [a letter signed by more than 1000 scientists protesting Nature’s decision to publish without the code](https://zenodo.org/records/11391920). Now, however, [AlphaFold3 is open source](https://github.com/google-deepmind/alphafold3). To be practical, we thought that it has to be open source; AlphaFold is a “big tool” that will be widely used, so “big tool” rules apply. Regarding the discovered protein folds, it isn’t clear as to whether they are open source (usually, we thought this kind of information would be clearly stated), but it doesn’t seem to be the case that DeepMind is patenting them. Some of us did not think that the outputs of AlphaFold should be patented; we did not want to see big pharmaceutical companies holding ownership of protein folding. When universities research drug development, it is done for the common good. When pharmaceutical companies use these tools, it is so they can develop drugs to make a profit. Arguably, this shifts value from societal benefit to monetary gain and undermines the work people have done to make new discoveries. The outputs of AlphaFold could be thought of like a periodic table, insofar as nobody owns it but it is a source of information to be drawn from. Sharing and accessing this knowledge could be protected by [Creative Commons (CC)](https://creativecommons.org/) to maintain a balance between reserving the rights of content owners for their work and allowing creators freedom in what they can use. Looking back into the history of computerised methodologies provides some insight for how to read the role of AI in the sciences. Instead of viewing AlphaFold as part of the recent history of AI, we should consider how it fits into a longer history of computational methods as applied to theoretical problems. [Numerical analysis](https://en.wikipedia.org/wiki/Numerical_analysis) is the study of algorithms using numerical approximation for mathematical analysis. Numerical analysis has been in use for more than 2000 years, far predating the invention of modern computers, and numerical methods have been part of physics B.Sc. for many years. [Terry Tao](https://terrytao.wordpress.com/) is a professor of mathematics who has been giving talks about the history of numerical methods. Computers used to refer to [women who would write down algorithms to be executed](https://en.wikipedia.org/wiki/Women_in_computing). One of the first computers was [Maria Mitchell](https://en.wikipedia.org/wiki/Maria_Mitchell) who was working on computing the motion of the planet Venus, and [Ada Lovelace](https://en.wikipedia.org/wiki/Ada_Lovelace) is often regarded as the first programmer from her work on [Charles Babbage’s analytical engine](https://en.wikipedia.org/wiki/Analytical_engine). The film [Hidden Figures](https://www.imdb.com/title/tt4846340/) narrates the stories of NASA computers and mathematicians [Katherine Goble Johnson](https://en.wikipedia.org/wiki/Katherine_Johnson), [Dorothy Vaughan](https://en.wikipedia.org/wiki/Dorothy_Vaughan), and [Mary Jackson](https://en.wikipedia.org/wiki/Mary_Jackson_(engineer)). AI builds on the history of computation to open up knowledge generation in two ways: making computing more efficient and providing new ways of structuring and analysing information. We thought that AlphaFold is generally a positive application of AI, as the tangible examples of AlphaFold’s output are wide ranging. ### How should researchers and policymakers respond when the model’s predictions can no longer be fully understood or straightforwardly verified by humans? Model outputs should be handled with the consideration that the outputs are predictions and could be wrong or change. The probabilistic nature of machine learning (ML) means that the same input will not always yield the same output, and results might not be replicable. AI tools like AlphaFold are attractive to humans as they are a useful way to frame problems into observable configurations. However, predictive models are not rich enough to solve every problem. Condensing phenomena into predictable patterns reduces complexity, complexity which in some problems is irreducible. There was an interesting part of the model explanation which seemed to insinuate “don’t worry about the physics for a minute, just fold away and hope it works”. In our experience in social work, we have come across models being used to predict which children might be at risk of future gang affiliation. However, there are so many external factors that come into play to influence the outcome; reducing this prediction to a binary classification based on a discrete array of input features does not account for the complexity of social phenomena. In the medical domain, if you autopsied someone that passed away from Alzheimer’s disease, you may find that a complex case with other factors influencing their prognosis, not just misfolded proteins. In [a previous Data Ethics Club](https://dataethicsclub.com/write_ups/2024-10-09_writeup.html), we looked at precision medicine, which attempts to use ML to tailor diagnosis and prognosis to individuals, yet has a lack of robust evidence of its superiority over over health professionals. Possibly, this is associated with the difficulty of performing causal inference without making assumptions to reduce the complexity of health conditions. Tools like AlphaFold which hide away complexity may stop people being curious and thinking outside of the box. The generalisability of AlphaFold is further limited by its lack of interpretability and verifiability. Verifying is important: the [outputs of AlphaFold are good and useful predictions, but they are not real protein structures](https://www.chemistryworld.com/opinion/why-alphafold-wont-revolutionise-drug-discovery/4016051.article). Large pipelines in massive datasets already require an element of trust; on the scale of AlphaFold, considering the wide ranging possibilities for application, the problem might be even bigger. It is difficult to guarantee how trustworthy the outputs are, although we know AlphaFold has been partly verified on known examples. Until someone comes up with a separate approach to predict protein structure, it is difficult to independently verify and replicate AlphaFold’s findings. We wondered if it is possible to reverse engineer AlphaFold’s outputs and find out why it gives the answers it gives. Connecting findings from tools like AlphaFold with real applications is constrained if we don’t understand the underlying mechanisms, as explainability is needed in order to assess model outputs within contexts. There is a high level of model complexity, and it can be difficult to interpret outputs. Synthetic biology is pushing boundaries, and the directions we’re going in may be very different than where we have been before. There are already limitations with understanding underlying mechanisms of experimental biology, so this is not necessarily a new problem. We wondered if we could restrict ourselves to low risk applications, and if we can know what the risk impacts will be of protein folding. When building and using ML tools, especially in medical and biological domains, it is important to think about the reasons underpinning the data collection. It is also important to consider potential nefarious applications. AlphaFold is influential in drug development because it enables much more specificity in targeting design. This does open up avenues for misuse, but all of the practical aspects require significant resources and money, so the realistic implications may be small. From our experience in lab work, there is a lot of rigour in genetics and implementing the findings of synthetic biology, as there has historically been a lot of scares in genetics and consequentially quite a few safeguards. ### Does lack of understanding about how AI systems work/glossing over how intricate these models are harm scientific development by leading too many people to just “try AI”? An issue that arises time and time again with AI is the gap between the people that understand how the tools are working, and the people that want to use the tools. Tools are increasingly accessible, but this is not mirrored by general understanding of their mechanics or suitable applications. A lot of us, for example, do not really understand protein folding. Lacking in-depth understanding of a tool is not always an issue, for example, drugs that improve serotonin uptake do help treat depression – we might not fully understand the mechanism, but they do still work and not understanding how they work shouldn’t stop us from using them. Yet, the gap between those that understand and those that use or interact with AI has real implications. In a randomised trial, you may not know all of the mechanisms, but a clear distinction between group outcomes may be sufficient comprehension. Modelling is different: users should have a clear idea of what’s going into the model, how it works, and why. Allowing a process that we don’t fully understand push forward advances in a field has potentially immense impacts for bad science. It all comes down to what we do with the information provided by the tools; problems arise when people don’t understand how the tool fits into the wider picture, or how to apply their judgement to the tool. If someone misinterprets the gap and thinks that the distance between their knowledge and what is actually going on is small (but it isn’t), they might assume that AI is a magic tool that can solve any problem. We’ve seen people be taken by surprise when they find that they have to learn technical skills like coding. Meanwhile, there are some people that might be really good statisticians with a deep understanding of the mechanics of ML, for whom it could be another tool in their arsenal, but don’t want to use it because they find it intimidating. As a policymaker, it is difficult to make informed policy decisions about AI unless you have some understanding of the mechanisms. This difficulty is exacerbated by the speed at which AI innovations are taking place, which is much faster than policy can adapt to. Ideally, users of AI will be able to understand the outputs, the performance of the model, and the range of errors so that they can properly evaluate outputs. We don’t think that users need to be able to derive a model from first principles to be able to use it. A decent intuition of how it works under the hood, and how to use it on a computer, are good places to be at for basic applications. With AlphaFold, if you understand the science, it is easier to evaluate outputs and spot errors; the error metrics are strong and reliable. It is also less so much a tool that hobbyists will use, so there are fewer chances of issues arising in terms of volume of use and misuse compared to LLMs, for example. ## Attendees - Huw Day, Data Scientist, University of Bristol: [LinkedIn](https://www.linkedin.com/in/huw-day/), [BlueSky](https://bsky.app/profile/huwwday.bsky.social) - Kieren Sharma, PhD in AI for Synthetic Biology, University of Bristol: [LinkedIn](https://www.linkedin.com/in/kierensharma/) - Holly Fraser, Postdoc in Immunopsychiatry, University of Bristol - Amy Joint, Programme Manager, ISRCTN Registry [LinkedIn](https://www.linkedin.com/in/amyjoint) - Vanessa Hanschke, Lecturer, University College London - Joe Carver, Data Scientist, Brandwatch - [Kamilla Wells](https://www.linkedin.com/in/kamilla-wells/), Citizen Developer, Australian Public Service, Brisbane - Caroline Schreiber, PhD Student, University of Bristol - [Robin Dasler](https://www.linkedin.com/in/robindasler/), Data Product Manager, California - Noshin Mohamed, Principal Social Worker & QA lead for Children's services - Adrianna Jezierska, PhD Student, University of Bristol