Data Ethics Club: AI Snake Oil Book Club#

What’s this?

This is summary of the discussions held in Data Ethics book club summer 2025, where we spoke and wrote about AI Snake Oil by Arvind Narayanan and Sayash Kapoor. The summary was written by Jessica Woodgate, who tried to synthesise everyone’s contributions to this document and the discussion. “We” = “someone at Data Ethics Club”. Huw Day helped with the final edit.

Chapter 1 – Introduction#

Chapter Summary#

The term “AI” has been appropriated for a multitude of different technologies and applications, confusing discussions about its strengths and limitations, and misleading people about its capabilities. The introduction begins by distinguishing between generative AI, which produces various types of content such as text or images, and predictive AI, which produces outputs forecasting the future to inform decision-making in the present.

The waters surrounding the definition of AI are muddied; influenced by historical usage, marketing, and more. Some problems AI attempts to address, like vacuuming, are solvable; others, like predicting whether someone will commit a crime, are not. Some applications were once thought difficult, but AI has proven to be very good at, such as facial recognition. However, even if the mechanism is good, the tool can fail in practice, e.g., if the data is noisy, biased, or abused for malicious intent.

Sold as an accurate and capable technology, predictive AI has been widely adopted. However, there are significant issues with the premises predictive AI is based on, as it attempts to translate and reduce high dimensional phenomena to computationally feasible outputs. In practice, there is missing data, bias, and participants attempting to game the system. Generative AI, the chapter argues, holds more promise and yet also runs into significant issues. Factual inaccuracies in outputs are common, leading to rampant misuse such as error-filled news articles or AI-generated books.

Misinformation, misunderstandings, and mythology about AI are fed by a combination of researchers, companies, and the media. In research, the buzzier the research topic, the worse the quality of research tends to be. Companies feed off of hype, seeking to maximise profits whilst rarely being transparent about the accuracy of their products. Crumbling revenue in the media combined with access journalism, where outlets rely on good relationships with companies to be able to interview subjects, further propagate hype narratives dictated by AI companies.

The limitations, hype, and misconceptions surrounding AI leads the chapter to analogise AI as a “snake oil” product: a miracle cure advertised under false pretences. The book aims to identify AI snake oil – AI that does not and cannot work.

Discussion Summary#

What do “easy” and “hard” mean in the context of AI? Does it refer to computational requirements, or the human effort needed to build AI to perform a task, or something else? And what does easy/hard for people mean?#

The idea of “easy” or “hard” problems for AI weren’t specifically defined in the chapter, but it did give some examples of problems previously thought to be difficult for AI which the technology is now very capable of solving, such as image classification. Spellcheck is another example of an application previously thought to be hard but is today so embedded in our everyday lives that we might not even think of it as AI.

Sometimes what is “easy” or “hard” for AI is contrary to what is easy or hard for humans – a phenomenon coined Moravec’s paradox. Moravec’s paradox is the observation that reasoning (which is difficult for humans) requires little computation, but sensory and perception skills (which are easy for humans) are highly computationally expensive. Computational complexity is a well-studied field examining the amount of resources required to run an algorithm. We wondered if easy and hard in relation to AI are somehow related to how difficult it is for AI to produce an accurate output.

Tasks that are easy for humans are those that we can do quickly and without much concentration. Pinpointing what these tasks are is tricky; we don’t know what we know. People aren’t good at breaking tasks down into smaller levels than humans are typically used to or deducing which of these tasks are easy. The difficulty with breaking down tasks is why some people find programming really tricky. What counts as easy might depend on the sample population, e.g., a roomful of mechanics will find it much easier to replace an engine than a roomful of mathematicians.

Discerning what is easy or hard for AI is not straightforward, as it depends on the lens you examine the tool through and the context it is situated within. LLMs now seem to be very good at predicting text, giving the impression that the task is easy. However, an extraordinary amount of resources goes into training an LLM, suggesting it is actually a hard problem. Facial recognition is accurate under the right conditions, but there are many drawbacks for how it is used, making it hard to delineate appropriate use cases. Perhaps easy and hard aren’t the right terms; we should instead be asking how much context is needed to solve the problem, and how hard the execution is.

The underlying message of the introduction is that certain tasks are made easier or harder from the application of certain AI systems, and the way we sell those systems affects how they are used. For example, linear regression is really good at some statistical modelling problems. However, if we sold it as a silver bullet to solve any problem, it would be (and, as the chapter demonstrates, has proven to be) rubbish.

The barriers to entry for AI have been drastically lowered over recent years. On the deployment side, LLMs are now very easy to set up on a personal laptop. On the user side, generative AI interfaces are widely accessible. It is actually starting to require more effort not to use LLMs in some settings such as searching, where search engines are displaying an “AI overview” before presenting the results. Tasks that LLMs previously found difficult are slowly getting easier and easier, making their deployment and user friendliness increasingly accessible. For example, maintaining context over time was something that LLMs previously found very difficult.

However, the context window, which dictates how much input the LLM can consider when calculating its output, is slowly getting larger. A larger context window entails that an LLM can consider more information in its answers. In addition, ChatGPT now incorporates some “memory” functionality, where information from previous conversations is maintained. However, sometimes relevant context goes beyond conversations, such as body language. This raises important questions about what kind and how much information goes into the model, e.g. whether it should consider current affairs or interpersonal relationships.

More data and computational power could mean that AI gets better, but this is not always true. It may depend on the use case and type of AI. For example, given enough training, AI can beat the best human chess player, but how long will it be before AI can make a cup of tea to your preferences, and is more computational power enough to solve this? If the task is something that the user doesn’t know much about, or is difficult to express, the usefulness of AI goes down. Training and human feedback is required on both sides. It is difficult to know where the improvements will stop or how the evolution of AI will unfold, especially considering the transformative effect of previous industrial revolutions.

Based on your definitions of these terms, pick a variety of tasks and try to place them on a 2-dimensional spectrum where the axes represent people’s and computers’ ease of performing the task. What sort of relationship do you see?#

Difficulty

Computer

Human

Both

EASY

Coding

Sense of right and wrong

Image classification

Pattern recognition in big sets of data

Ironing

Speech to text

Analytical decision-making (depending on algorithm and input data)

Chess

Remembering lots of information

Recall (for some computers but difficult for LLMs)

Writing a poem in a certain style

Sucking in created content

HARD

Ironing

Coding

Creativity

Ethical decisions / judicial

Content creation

Reading emotions

Holding/understanding context

Sarcasm/jokes

Moving around in a new environment

Certainty of answers

Language manipulation

Counting “r’s” in the word “strawberry”

The text gives many examples of AI that quietly work well, like spellcheck. Can you think of other examples? What do you think are examples of tasks that AI can’t yet perform reliably but one day will, without raising ethical concerns or leading to societal disruption?#

Applications of AI that we could envision being less ethically concerning include scientific pursuits, such as animal conservation for tracking animals, biodiversity monitoring, birdsong recognition, weather prediction, pollution analysis, or finding biomarkers for diseases. These use cases do not directly dictate outcomes for people’s lives and can be verified by complementary scientific methods.

With respect to AI being used in everyday life, AI should act as a facilitator for human flourishing, rather than a supplement. We would much rather have AI do our laundry whilst we make art, rather than have AI generate “art” whilst we do laundry. If ChatGPT was reliably accurate, we could imagine it being useful as a learning tool e.g. by quizzing students on their homework. Tools like Grammarly are great in certain contexts, because native English speakers can be harsh when assessing people’s writing. When people don’t have English as a first language, grammar correcting tools can help them to be sure they are saying the right things.

There are many AI tools we use every day without noticing, but that does not mean that they are working well. Spellcheck is integrated in many applications, however, we do not think it works well. If you write in more than one language, it falls apart and doesn’t understand what context it’s in. We tend to have counterintuitive expectations of the abilities of these tools; we recently encountered someone praising ChatGPT for being helpful with writing code about 60% of the time. If someone was asking us for help, and we were getting it wrong 40% of the time, we would not expect them to ask us for help again.

We would be hard pushed to find an AI tool that won’t raise ethical concerns or cause some disruption. If something could replace a job, it could cause disruption. There are lots of areas in which people don’t care how good the labour is, they just want an output – even if it’s inaccurate generative AI nonsense. Even seemingly innocuous applications like spellcheck or Grammarly have rippling implications, e.g. for students. We have experienced spellcheck changing our answers leading to a quiz fail. If students can use something to do their work, it affects the work of teachers.

Once a tool can provide an answer it would take a human a while to come up with, it is easy to slip into cognitive offloading. We’ve noticed that tools like Grammarly are now pitched at native English speakers, potentially discouraging them from improving their own grammatical abilities. Lots of native English speakers who should already have these skills will defer to grammar checking tools, assuming that the tools will be better. People have authority bias, doubting their own knowledge because the tool has told them something different.

The disruptive potential of AI may depend on the application and sensitivity of the data. It is important to take into account the whole pipeline, considering not just what the tool does but the energy it consumes and where the data used to train it comes from.

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Careful consideration of our attitudes towards AI is crucial in shaping the role it plays in our lives. It is difficult to maintain healthy scepticism of AI in the face of overwhelming hype. In general, the chapter sets a good tone, balancing criticisms with an awareness of potential benefits. Some of us have come to the book with a pro-AI attitude, looking to engage with it as a challenging view. Predictive and generative AI have beneficial use cases, e.g., generative AI can help people get started with a prototype for an idea even if they don’t know how to code. Others among us are looking to ask whether these benefits are worth the costs.

Some of us are becoming increasingly hardline anti-AI, finding that the chapter doesn’t go in as hard as it could (or maybe should) against AI. The chapter focuses, rightly, more on predictive AI, but it could have been more critical of generative AI. Fundamentally, many of us believe generative AI cannot do anything new. The outputs might be useful but are fundamentally unreliable in the sense that there is no validation of their correctness. The chapter talks about how “facial recognition works well enough to be harmful, and badly enough to be harmful” – this probably applies to LLMs as well.

To highlight the problems with applying AI, we considered the difference between using facial recognition to block someone from attending an event, and having a bouncer at the door with a list of people not allowed in. The difference may lie with accountability; you can argue with a person or ask them to take action, but with an automated system you will get nowhere. A person might lose their job if they do it incorrectly; if a system makes a mistake, it does not pay for the consequences, and the costs of rollback are often too high to warrant a system change. A human security guard can’t be scaled up to surveil 100,000 people at once and get 5% wrong with no recourse.

Distinguishing between the technology itself and the people behind it is increasingly tricky, as the people behind it are increasingly distanced and invisibilised from the tool itself.

Attendees#

  • Huw Day, Data Scientist, University of Bristol: LinkedIn, BlueSky

  • Jessica Woodgate, PhD Student, University of Bristol

  • Liz Ing-Simmons, Research Software Engineer, King’s College London: Mastodon 👩‍💻

  • Vanessa Hanschke, Lecturer, University College London

  • Julie-M. Bourgognon, Lecturer in neurosciences, University of Glasgow

  • Euan Bennet, Lecturer, University of Glasgow: LinkedIn, BlueSky

  • Paul Matthews, Lecturer, UWE Bristol BlueSky, Mastodon

Chapter 2 - How predictive AI goes wrong#

Chapter Summary#

The chapter discusses the precedence of companies making strong claims made about the utility of automated decision-making systems using predictive AI in order to sell them. Models are claimed to be accurate, efficient (requiring little to no input from humans), and fair. One of the appeals of predictive AI is that it can reuse datasets that have already been collected for other purposes, such as record keeping or bureaucratic tasks. Models are also appealing through the promise of attempting to predict the future, as people struggle to deal with uncertainty or randomness and seek methods that enable control. These systems are used to allocate resources, provide or withhold opportunities, and predict peoples’ future behaviour.

Yet, whilst a model may make a decision from an input without human involvement, human decisions still exist throughout the pipeline. Humans dictate the design of the model, the data that is collected, and the methods of deployment. It cannot be guaranteed that these decisions are unbiased or fair. Models tend to make predictions that are correct according to the way they were designed, but issues can arise in deployment. Important data may be missed or misunderstood, the system may change, or people may employ gaming strategies.

Once in place, systems are extremely hard to reverse, and people are unable to challenge decisions. Decision subjects are frequently unaware that they are being evaluated by AI, yet the decisions made can have life changing implications and mistakes are hard to fix. Even if human oversight is in place, it is often inadequate. Costs of flawed AI disproportionately harm groups that have been systematically excluded and disadvantaged in the past. Instead of treating people as fixed and their outcomes as predetermined, the chapter argues that we need to accept the inherent randomness and uncertainty in life. Institutions should be built with an appreciation that the past does not predict the future.

Discussion Summary#

Predictive models make “common sense” mistakes that people would catch, like predicting that patients with asthma have a lower risk of developing complications from pneumonia, as discussed in the chapter. What, if anything, can be done to integrate common-sense error checking into predictive AI?#

The idea of being able to model common sense is difficult to accept, as it is a changeable and contested concept. Common sense mistakes happen with or without AI, thus perhaps automated decisions are not necessarily worse than those made by humans. Models are a product of how they are trained; a model is only as good as the modeller. If what goes into a model is rubbish, you can expect that what comes out will be rubbish too: “garbage in, garbage out”. Algorithms are trained on data that reflects biases and prejudices held in society, such as racial inequities. We’ve found evidence of this in research we’ve conducted investigating the ability of a neural network to look for differences in skin tones in images of skin cancer lesions, finding that the network performed poorly. In the case of detecting skin cancer lesions, the implications this has for detection rates for varying racial groups is troubling.

In addition to the data that goes into a model, the design of the model itself, such as the variables included, influences the predictions the model makes. It is crucial and difficult to select variables that are meaningful with respect to the question being asked. Identifying causal inference is hard, and not every relationship you see in data is a causal relationship. Many AI models today have far too many variables to exhaustively check; ChatGPT-4 is estimated to have 1.8 trillion parameters.

Considering the inherent difficulties with building common sense into a system, perhaps the best approach is to focus on employing common sense in the application of AI. To cultivate common sense application, the general perception of computational systems as infallible will need to change. People tend to treat AI systems as more knowledgeable than humans, thinking that computers don’t make mistakes and over-crediting their veracity. Humans transfer social trust onto machines, projecting an idea of “expertise” as systems appear to give confident answers on the base of some hidden knowledge. There may to be some correlation between size and scepticism; there tends to be more apprehension around small studies compared to larger ones, and similarly we are more likely to mistrust a smaller model compared to a massive one. However, just because something is big doesn’t mean it is right.

To properly evaluate a model, the broader context must be considered. Caution does not tend to be incorporated into models yet plays an important role in the way humans make and apply decisions. It is also important to distinguish between different sorts of problems, asking if the model itself is bad, or if the application of it is inappropriate. One should consider how the data was collected and what the shape of the problem looks like. For example, in predicting blood pressure, most people don’t have issues until they are older, creating a “U” shape in the model. Therefore, application of a linear model to this problem will not fit.

Incorporating human checks and oversight throughout the AI pipeline could help mitigate unwanted side effects. To avoid responsibility “washing”, wherein companies claim to have oversight without implementing proper procedures, oversight will need to be carefully defined including the likely failure modes to detect. Systems will need to be explainable so that overseers can understand what they are looking at. Perhaps legislation is one solution to require companies to explain the weaknesses of their models and highlight where oversight should be more closely applied, similar to how companies are required to list side effects for medication.

Statistical modelling is generally thought of as reliable, requiring deliberate choices which are made explicit. In machine learning (ML) contexts, sometimes those choices are not as transparent. There tends to be a lot of opacity in decision-making that goes into the development and application of models. To understand how black box models are working, the criteria that the model uses to make decisions should be pulled out. Sometimes, machines are making connections we would not know to make, or inferences that are not what they seem. For example, in healthcare applications, image classifiers have been found to be taking into account other elements of the image in their decision such as the use of rulers in an image.

If we are incapable of understanding a model, such as a cancer screener, we wondered whether or not we should be using it. It may not be essential to understand the mechanics of all the technology we use. Most of us use black box applications every day without question, such as computers, central heating, or aspirin. Sometimes more transparency might not be appealing if revealing the criteria for decisions facilitates applicants gaming the system. In these settings there might be specific audiences for whom the system should be transparent to, e.g. the system should be transparent to hiring managers, but not to applicants. On the other hand, perhaps applicants would like to know the criteria they are being judged by, and withholding this information may be unfair.

Think about a few ways people “game” decision-making systems in their day-to-day life. What are ways in which it is possible to game predictive AI systems but not human-led decision making systems? Would the types of gaming you identify work with automated decision-making systems that do not use AI?#

There is an increasing sense for upcoming generations that to stay ahead they need to game systems, and those who don’t realise systems are gameable are put at a disadvantage. For example, understanding how to answer personality tests prevents job applicants from minority groups being screened out. Part of the hacker mindset entails you are smart if you are able to hack things.

Even in non-AI contexts, there are plenty of examples of systems being gamed. People frequently game each other through manipulative tactics. Knowing what to say can get you the right hospital treatment or make you eligible for benefits in pre-screening processes. We wondered where the line is between tailoring appropriate communication to a particular audience and gaming a system. Perhaps there is some relation to whether one’s actions are working towards a particular desired outcome.

Settings in which people game predictive AI systems include search engine optimisation, contributing to the enshittification of the internet by filling websites with algorithmic buzzwords; job applications; social media. Industry resumes tend to be shorter than academic resumes, incentivising the use of buzzwords and listing skills like “leadership”, “Java”, etc., that will bump the applicant up the list. To get your profile seen, we feel an increasing pressure to make our job applications and cover letters “LinkedIn friendly”. A funny side effect of this is that when someone says you are good at LinkedIn it feels like an insult by being seemingly inauthentic.

Buzzwords are also used to game funding or grant applications. We have seen examples of proposals suggesting a project will be using “AI” in its methods, where in reality it is not really an appropriate application of AI. Sometimes people want to use AI because it is trendy, whether or not it would actually solve the problem. Many people do not really understand what AI is, or the what the differences are between specific terms such as ML and AI (ML is a subfield of AI). Being clear about terminology, especially in the mainstream, is complicated by the hype cycle, which propagates uncertainty and sensationalist narratives.

The hype cycle encourages people to adopt trending methods even if they do not have sufficient background or training to understand how the methods work, which is detrimental to the quality of research. Bad research deteriorates public trust in science and scientific outcomes. Interdisciplinary collaboration is essential to better support researchers, and perhaps the importance of various disciplines should be discussed at lower levels of education to foster this. Psychology has suffered credibility crises, but this has led to stronger research practices such as hypothesis pre-registration and more publishing of negative outcomes in journals.

In which kinds of jobs are automated hiring tools predominantly used ? How does adoption vary by sector, income level, and seniority? What explains these differences?#

We weren’t sure which fields automated hiring tools are currently being used in, so instead discussed the fields such tools might be best used in. Hiring tools could be helpful for positions with a high ratio of applicants, as sifting through thousands of applications is a time consuming and repetitive task. Another appropriate application for automated tools may be entry level jobs in which group interviews are already commonplace, to help speed up the process.

What change would you like to see on the basis of this piece? Who has the power to make that change?#

People gaming automated systems may change the systems themselves, depending on how and what the algorithms are learning. For instance, hiring algorithms may learn to favour people who figured out how to game them and thereby further incentivise those behaviours. Gaming the system can push things towards homogeneity, which we have seen in other applications such as TikTok, where monetisation depends on the length of engagement and so videos tend towards a minimum length.

As AI is adopted by both hirers and applicants, a feedback loop is formed where LLMs are used to write and apply for jobs, other AI systems sift through the applications, and each side learns to game the other. Increasing dependence on LLMs will lead to more mistakes: LLMs are producing less and less accurate results, and are shown to repeatedly hallucinate and backtrack. It is important that society finds ways to resist slipping into homogeneity and error-strewn information as a consequence of LLM overuse.

Attendees#

  • Huw Day, Data Scientist, University of Bristol: LinkedIn, BlueSky

  • Julie-M Bourgognon, Lecturer, University of Glasgow

  • Vanessa Hanschke, Lecturer, University College London

  • Paul Matthews, Lecturer, UWE Bristol 🦣 https://scholar.social/@paulusm

  • Liz Ing-Simmons, Research software engineer, King’s College London (my day: :hot_face: (it’s too hot here!)) | Mastodon

  • Nicolas Gold, Associate Professor, UCL

  • Martin Donnelly, Principal Research Data Steward, UCL, martin.donnelly@ucl.ac.uk (whatever the emoji for being embarassed at not having read the chapter yet is)

  • Robin Dasler, data product manager, California

Chapter 3 - Why can’t AI predict the future?#

Chapter Summary#

The sheer number of correlations that certain AI methods can identify in data, and the usefulness of those methods in identifying which correlations are important, has contributed to the popularisation of AI to predict the future. Despite the efficacy of AI in identifying correlations, the world is profoundly complicated, and phenomena are often compounded by a myriad of variables. The weather, which humans have been trying to predict for thousands of years, is a good example of how easily real-world complexity interferes with prediction methods. In the 1960s, Edward Norton Lorenz found that rounding the numbers of weather simulations to three decimal places instead of six gave vastly different results. This finding led to the coining of “the Butterfly Effect”: the observation that small errors in measurement (e.g. of temperature) lead to exponentially increasing errors later. The farther away the prediction, the larger the error.

There are two main paradigms for predicting the future: simulation, and ML. Simulation is based on the idea that the future evolution of a system can be predicted using the current state of the system and equations describing how the system changes over time based on the interactions between its components. Simulations have proven useful for some applications, like the weather, and terrible at others, such as social phenomena like modelling a whole city. ML on the other hand uses past data to learn underlying patterns and make predictions about the future, without fixed rules about how the future will play out given the past. Predictions and patterns can adapt and change over time, but the “rules” determining this are based on how the system behaved in the past. The chapter argues that simulation is more suitable for predicting things about collective or global outcomes, whereas ML is better suited to predict things about individuals. ML is appropriate in settings where there is a lot of data to train on, such as examples of spam or not spam emails, whereas simulation is appropriate where there is domain knowledge but not enough examples, such as food shortages.

The line between what is and isn’t feasible in prediction social phenomena is ill defined. It is too broad to state that we can’t predict any social phenomena, as some dynamics like traffic or how busy a store will be on a certain day can be predicted reasonably well. However, predicting a person’s future is hard because of a combination of the real-world utility of what can be done with the prediction, the moral legitimacy of the application, and the prediction’s irreducible error (error that won’t go away with more data and better models). Statistical or AI models used to predict life outcomes have repeatedly been shown to be little better than random, with complex models offering no substantial improvement above simple baseline models.

A problem in predicting the future is the difficulty in defining the accuracy of a prediction: the range of output that is considered accurate (e.g. for the weather, whether temperature is accurate within one degree or five degrees); the type of output (e.g. if it will reliably predict it will rain regardless of temperature); how the accuracy of one domain compares to another (e.g. weather patterns compared to sales patterns). To assess if a prediction task can be done well, there are qualitative criteria that can be examined, such as looking at what kinds of prediction tools people actually use, what can be done with a tool, relevant power dynamics and moral implications, and whether a prediction will improve with more data and better models.

It is not yet possible to conclude whether there are fundamental limits to predicting human lives, as has been proven in other domains such as planetary orbits or thermodynamics. We do not know what the irreducible error is for social prediction; it may be possible that predictability of life outcomes would be improved by the availability of more data. Yet, it seems likely that the irreducible error is high as chance events can drastically alter the path of a person’s life, either through large shocks or small (dis)advantages that compound over time. Patterns underlying social phenomena are not fixed and differ greatly based on context, time, and location. The sheer amount of data needed to predict life outcomes would be very high because social datasets have a lot of noise and the number of samples needed to create accurate models sharply increases as the samples become noisier. There are thus some limits to predicting the future that could be overcome with more and better data, whilst other limits seem intrinsic.

Discussion Summary#

The authors list 3 main criteria for good prediction use cases: real world utility, moral legitimacy, and irreducible error (error that won’t go away with more data and better methods). Is this list complete?#

To understand what makes a good use case for prediction, it is important for both scientists and non-scientists to be clear about the limits of probability. It is impossible to predict everything as probability entails there will always be a likelihood of false positives (when a result incorrectly indicates the presence of a condition). Scientists should be aware of the characteristics and implications of probability, but the awareness might not pervade in those without scientific education. The human intuition about probability is sometimes good, and sometimes terrible. We’ve seen examples of the fallibility of intuition when teaching probability using loaded dice at science festivals, finding that children figure out the problem much faster than adults. The reasons behind this may be associated with ingrained expectations and assumptions held by adults which children are not exposed to.

As the rules of probability entail fundamental limits to prediction, a use case may be acceptable if the model fulfils criteria that makes it “good enough” and the consequences of incorrect predictions are not too severe. For example, weather prediction is good enough for most people if it roughly predicts the temperature, chance of rain, and other important features. If weather forecasts miss severe weather events, then their utility would be greatly reduced. A system may pass the “good enough” test if it can only be used in certain situations for predictions.

Prediction should be reasonably accurate, but defining what is reasonable is hard. All of the criteria the chapter discusses (real world utility, moral legitimacy, and irreducible error) are interlinked. The higher the moral stakes, the higher the need for accuracy and demand for lower irreducible error. Moral legitimacy includes aspects such as the impact on people’s lives, e.g. the impact of accusation or incarceration in the case of criminality prediction.

The impact of prediction on life outcomes draws attention to the fact that sometimes the prediction itself can change the outcome. If there is a really good way of predicting the future, the predictor might undermine themselves (the TV show Devs explores the spookiness of the relationship between prediction and free will). Because of the potential influence of prediction in changing outcomes, in some countries it is illegal to release polling data before elections as it could impact polling behaviour. On the other hand, there are situations where this is harnessed so that the purpose of the prediction is to instigate people to do something differently; the act of prediction does not thereby undermine the predictor. This however means you are no longer predicting the future, as you are making a prediction with the intention to stop it from happening.

The contradictory effects of prediction on changing outcomes made us wonder why polls are carried out in the first place; what the real world utility is and who they are intended to benefit. Sometimes, it is useful to know if the person in your seat can never win so that you can vote strategically, but in non-strategic voting systems this does not matter. There is a difference between polls used to tell people what the current voting preferences are, and exit polls which are taken immediately after voters have left the polling station to ask who they voted for. Exit polls are used as early indications for the outcome of the vote.

Some of the examples in the chapter seemed a bit contrived and some of us raised questions about the 8 billion problem (that we can’t make accurate social predictions because there aren’t enough people on earth to learn all the patterns that exist). There are some things that we can predict with a high amount of certainty, for example, I can predict that if I take a flight tomorrow, I’ll probably land safely. Whether it is impossible for 8 billion people to make a dataset that is big enough may depend on what we are looking for. Often, we have enough data for people within the distribution we are looking at, but problems might arise with edge cases (e.g. rare diseases). For edge cases, we might have enough rows (i.e. participants) but not enough columns (i.e. features). GPs in the UK use QRISK to predict heart attacks, which was developed only on several hundred people. Improvements on QRISK may come from the addition of more features.

Cumulative advantage implies a lot of success comes down to luck, which challenges a meritocratic world view. How do you feel about this?#

Putting too much emphasis on luck can be detrimental; we don’t want the takeaway to be that everything is luck so there is no point in trying. It’s still important to strive for things we think or know are impossible, as incremental progress can be made. Some of us thought that we should strive for more meritocracy because of this. We wondered if we would be able to predict success better if we lived in a more meritocratic society. Overemphasis on meritocracy may however run into the problem of predicting and thereby pre-empting success, promoting those who are predicted to be successful rather than those that have proven to be successful. One of the main criticisms of meritocracy is that it ends up supporting those who already have an unfair advantage due to societal power dynamics.

Cumulative advantage is something that we have observed in our day-to-day life such as by clicking more on the most downloaded songs and other demonstrations of herd behaviour. For music and movies there are other factors that intertwine with herd behaviour and may factor into cumulative advantage such as the effect of nostalgia and role of cultural background.

Social sciences focus on understanding causal mechanisms, not predicting associations as opposed to typical machine learners. What do both fields have to learn from one another?#

Traditional statistical models are based on explicit mathematical equations that can be interpreted by experts and explained to others. In comparison, understanding how a neural network makes a particular decision is very difficult. The connections the network learns between data are masked by the copious number of parameters involved, making it difficult to draw out what exactly happens within a model and how it is handling the data. This intrinsic implicitness complicates identifying how best to interpret the model and understand how the outputs are produced from the inputs.

Causality is a murky question and a philosophical rabbit hole, so there is a tendency among statisticians to avoid it. Yet, if we predict what will happen based on what has already happened, we will propagate and amplify historical issues into the future. Instead of using the past and all the baggage it comes with, predictions regarding human lives should place more emphasis on causality.

Attempts to understand causality must be accompanied by domain specific knowledge and a scientific mindset. A good example of the success of domain knowledge and scientific method coming together is the true story that was adapted into the film Moneyball. In Moneyball, baseball pundits had bad judgments about what made good players. Billy Beane, the manager of the Oakland Athletics baseball team, hired a Yale economics graduate and together they were able to derive what made players valuable and build a successful team.

ML models are very good at identifying patterns, but the utility of those patterns depletes if they are detached from domain expertise and we are unable to understand how the patterns are made. Domain expertise is essential to pursue actually useful models; without domain expertise to define what exactly we are looking for before starting the modelling process, there is little chance of finding useful patterns. In ML development there seems to be a tendency to chuck a bunch of data into black box models that the developers can’t explain, rather than putting in the work to incorporate domain expertise. The mystery that is so embedded in this process lends favour to the image of “tech bros” who are smart at STEM, so people assume they are good at everything else. This characterisation is bad – it stops people from questioning and criticising them, enabling bad decision-making and unaccountability.

As more and more information becomes available to us, we need more and more time to become experts. It is crucial to recognise expertise across fields and support experts in communicating with each other. Some of us wondered if this is an opening for generative AI to emulate expertise across an expanse of fields. Others worried that the use of generative AI here would mean skipping a communication step, and perhaps it would be more useful to use it to augment and support communication rather than replace it. If generative AI is used in replacement, we wondered if there is a point in proving a theorem if no-one can understand or replicate it. However, generative AI surpassing human ability in certain settings might not be important if the application of the technology furthers human knowledge; there are cases where geniuses like Stephen Hawking have proven something that nobody else could have.

To delineate (in)appropriate uses for generative AI we need to be clear about what our goal is; whether we want to learn something, or whether we just want a task to be completed. For some of us, proving something that no humans can understand is less interesting even if we don’t think it’s less valuable. At the very least, someone should be able to understand it. In education settings, there is too often an overemphasis on doing assignments and turning them in, rather than spending the effort to truly learn the lesson underpinning the assignment.

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Limitations to ML methods entail that when approaching a problem, it is important to ask if ML is actually the appropriate tool to use, or if there is a faster and cheaper way. In the book, the examples of successful predictive models were interpretable mathematical models and the ML examples either failed or were worse than “simpler” models. We wondered if there are any examples of ML models that work, and work better than simpler models. The lack of examples of successful ML may be because ML is a (relatively) young field. ML may become increasingly better than simpler models as the field matures.

Attendees#

  • Huw Day, Data Scientist, University of Bristol: LinkedIn, BlueSky

  • Jessica Woodgate, PhD Student, University of Bristol

  • Euan Bennet, Lecturer, University of Glasgow, LinkedIn, BlueSky

  • Virginia Scarlett, Open Source Programs Specialist, UC Santa Barbara

  • Robin Dasler, data product manager on hiatus, California -Julie-M. Bourgognon, lecturer, University of Glasgow

Chapter 4 - The Long Road to Generative AI#

Chapter Summary#

Generative AI comes behind a long history of computational innovations. From the development of the perceptron in the 1940s, based on a mathematical model of a neuron, to the boom and busts of neural networks over the latter half of the 20th century, eventually bringing us where we are today with massive neural networks that can generate diverse outputs from text, to images, to audio, and more. The chapter argues that generative AI can be hugely beneficial for a range of use cases from assisting knowledge workers, who “think for a living”, to improving accessibility such as for the partially sighted.

One application that deep neural networks have proven really effective in is image classification. Historically, labelling and classifying images required a marathon of human effort. Today, it is something that can be done in seconds using AI. Building on the success of neural networks, over recent years generative AI has received an escalating level of attention. There is a huge community of people working on the technology, contributing to its application in an ever increasing expanse of use cases and rapid evolution of the technology itself.

As we see time and time again, alongside excitement for a new AI technology comes a range of harms. The rapid uptake of a technology before it is properly understood and potential harms have been rigorously studied opens the door for destructive consequences. Generative AI has produced outputs that are racist and inappropriate, oversexualised or claiming to be sentient, declared sentient by prominent engineers, been used in real law cases to generate citations proven fake, associated with suicides when used as companion bots, and used for non-consensual porn deepfakes. Copyright issues have led to a number of legal cases currently passing through the courts concerning the appropriation of artists’ work to train models. Models do not just adapt the works they have been trained on to produce new outputs, but in some cases can almost exactly replicate instances of their training data, such as the Mona Lisa.

After training LLMs on copious amounts of data, additional fine tuning has led to highly convincing chatbots which are good at producing believable responses but lack real-world grounding and verification. On a practical level, chatbots are at their core statistical engines. They have incomplete internal representations of the world around them and thus lack true “understanding”. LLMs have a tendency to produce “bullshit”, which are outputs that attempt to persuade without regard for the truth. Researchers’ understanding of these internal representations is still rudimentary due to the lack of interpretability in what neural networks encode. We do know, however, that because LLMs learn from historical data, they will also learn any historical biases contained within that data. Attempts to mitigate these harms come in the form of more human annotation, which raises more issues with worker rights as the labour tends to be outsourced to poorer countries than where the company is located so that they can capitalise on low wages and less regulation.

Discussion Summary#

Various AI systems, such as the neural networks in the ImageNet competitions, are designed to automate something that is expensive to do on mass (e.g. labelling images) and where the process in doing so is not valuable to the humans doing it. What tasks do you see GenAI (both LLMs and image generators) being used for and how much are they being used for processes where there is value for humans performing it?#

Many tech leaders are loudly proclaiming the promise of generative AI to automate work as we witness a wave of mass firings and human workers being replaced by AI (explored in a previous Data Ethics Club). Generative AI proponents advocate that the technology will free humans up for meaningful work. The implication of these messages is that the work people are attempting to replace with AI, including some of our jobs or things that we do in our jobs, are not meaningful.

Defining what it really means for something to be valuable to humans is critical to ensuring that generative AI is used in ways that truly benefit society and yet seems to be often brushed aside. The current landscape of generative AI tools tends to implement them in obvious applications that the tools can speed up, such as writing emails, coding, or labelling images, but without delving deeper to ask questions about if these are the sorts of things that we actually should be using it for. If we want to use AI to label people in photographs, who is that valuable to, and why is it valuable? Many tools have been proposed as a solution for efficiency, without critically assessing whether there is more besides efficiency that needs consideration.

Perhaps generative AI should be used for doing tasks that nobody wants to do and that we would rather a machine do than a human. If people do not understand or appreciate the effort required for a task, perhaps it is not worthwhile putting a lot of time into it, and we should use whatever means available to us to expediate it. An application we have seen for computer vision that seems intrinsically valuable to us is medical imaging, where previously we worked on teams with many people spending hours labelling images. Many of those labelling images found the task to be boring, and did not leave them room to grow through their jobs. Today, AI demonstrates high utility for completing these sorts of tasks quick and at high volume, leading to significant scientific contributions such as creating a complete map of the neurons in a fly’s brain and tumour identification.

Other touted applications of AI may not be as appealing as they seem when we delve deeper into them. We have conducted work anonymising text data about medical conditions, which is a task that could arguably be automated by AI with advantages for efficiency. However, whilst we found the experience to be quite heartbreaking it also really motivated us to work on the project and helped us appreciate the stories within the data. We wouldn’t want to lose this connection with stories; if you just deal with the summary level you do not get same the richness as reading the actual interviews. These experiences highlight how there is value that lies in painstaking qualitative data analysis. Some of us think that a hybrid approach could work, where generative AI can be used for qualitative data analysis but still needs to be checked.

Concerns with the trustworthiness of generative AI gives us reason to be cautious in our use of it and question how valuable it is for day-to-day use. A lot of the marketing surrounding generative AI tools centres around automating admin tasks, but we were unsure if we would trust an AI agent to perform these tasks correctly and in privacy preserving ways. It may take a significant amount of effort and careful wording to prompt an AI in such a way that it completes your requests truly according to your specifications.

There is an important distinction between the value of the product and the value of the process. Something that is valuable to do intrinsically is not the same as something that is valuable by its outcome. Sometimes the process itself is valuable, such as in making art. Often artists are admired for the effort that goes into creating art. The hype narrative centring on using AI to automate “boring” tasks is a narrow view that just focusses on speeding up the process to get to the product faster, rather than thinking holistically about the whole pipeline. Using AI to write code has become very popular, with many users just wanting something that runs without error even if it does not evaluate the right answer. A lot of students and staff we encounter use LLMs like they’re a search engine and expect the output to always be correct. Outsourcing the search strategy to an LLM shortcuts what could be a valuable process of evaluating sources and their utility. Many people don’t know this process exists, let alone recognise its value. Searching and analysing sources is a critical skill taught by librarians that should be more highly recognised.

A question that seems to have been skimmed over by the AI hype narrative is whether we really want everything to be maximally efficient. There is an appeal to doing mundane work and many of us want some parts of our jobs to be boring. We can’t be turned on and 100% high functioning all the time, and there are other things that matter to us in life aside from work which we want to save energy for. Getting bored gives space for creativity to flourish, to get curious, and find meaning in life. Monotony also provides opportunities to socialise. We have had jobs doing repetitive factory work but found it to be very social as we chatted with the people around us while we worked. Across many sectors of society more and more distance is being created between people, disincentivising interaction with others. In our offices people used to spend more time standing around the water fountain chatting. Shopping used to involve talking to someone at the checkout and someone helping you pack your bags but now that checkouts are automated it is harder and harder to talk to an actual live person. Sometimes when interacting with a company you don’t even know if you are interacting with a human or a bot. We’ve had interactions that seemed like we were talking to a bot but turned out to actually be people in India.

Throughout our lifetimes, we have seen many fads come and go. For example, there has been excitement about robots performing surgery since the 1970s yet we are still operated on primarily by humans. The extent of the hype and disruption around AI, however, makes us wonder if it is a craze that will not tail off. Hype cycles are problematic (explored in a previous Data Ethics Club). One of the many issues is that hype generates over trust in machines leading to technologies not being properly checked. Consequences of improper checking can be life changing for people, such as the Post Office scandal in the UK where hundreds of people were prosecuted because of faulty accounting software. Rather than investigate and fix the problems with the software itself, the Post Office blamed branch operators for financial discrepancies.

It is also important to acknowledge that it is possible to automate systems without using AI, such as using statistical models for medical imaging versus AI and ML techniques. For medical imaging, ML methods do not address causation which is central to identifying diseases. If we are using ML for these sorts of applications, then proper checks and balances are indispensable.

In automating processes and evaluating models we must be clear about how we verify what is considered good or bad. This is easier in some domains than others. For example, if we know what a good scientific image diagram looks like, where there are well-defined rules and criteria, we can evaluate whether an AI has done a good job in producing one. It is harder to evaluate whether an AI summary of a long article is good if we haven’t read the original article or know the domain in depth. Some of us wondered if perhaps no-one without a PhD should be using certain AI tools, and outputs should be checked by multiple people with PhDs. Similar to how new drugs are evaluated in pharmaceuticals, all consequences and side effects should be considered. In AI settings, this includes bias and institutional and historical power dynamics; as ML relies on and continues to propagate the past, issues of the past will percolate into the future if unmitigated.

Our discussions make some of us yearn for a simple formula to work out if something is a good use case for AI. Important questions to ask would include how boring the work is, how verifiable the results are, how risky the setting is, whether there is enough data to make an accurate prediction, and whether it removes the accountability for a decision from a human. These are all essential considerations but must be held alongside the knowledge that there is often not a simple answer and ethics is grey.

Discuss the environmental impact of generative AI. What, if anything, is distinct about AI’s environmental impact compared to computing in general or other specific digital technologies with a large energy use such as cryptocurrency?#

An important distinction between generative AI and other energy intensive technologies, such as mining cryptocurrency or running complicated physics simulations, is that generative AI is extremely easy to use. The ease of access prevents people from thinking about environmental concerns; because it’s easy to prompt, the impression is given that it’s computationally easy to “do”. Many people see computers as magic boxes that do things and are not aware of their computational costs. Even for technical people it’s hard to find information about the environmental impact of computers, especially regarding the emissions associated with manufacturing. For example, manufacturers do not provide information about the impact of GPUs compared to CPUs. It’s important to widen awareness of the mechanics behind AI tools, how they work, and what they’re good and bad at. Equipping people with this knowledge helps them to be more selective in their use, so that instead of jumping straight to an LLM they first use a calculator or Wikipedia.

Attendees#

  • Huw Day, Data Scientist, University of Bristol: LinkedIn, BlueSky

  • Jessica Woodgate, PhD Student, University of Bristol

  • Liz Ing-Simmons, RSE, King’s College London :wave:

  • Virginia Scarlett, Open Source Programs Specialist, UC Santa Barbara :coffee:

  • Beverly Shirkey, Medical Statistician (Clinical Trials) University of Bristol :smile:

  • Paul Matthews, Lecturer, UWE Bristol

  • Jessica Bowden, Research Associate, University of Bristol :cat:

  • Robin Dasler, data product manager, California