Data Ethics Club: Time to reality check the promises of machine learning-powered precision medicine #

What’s this?

This is summary of Wednesday 9th October’s Data Ethics Club discussion, where we spoke and wrote about the article Time to reality check the promises of machine learning-powered precision medicine by Jack Wilkinson, Kellyn F Arnold, Eleanor J Murray, Maarten van Smeden, Kareem Carr, Rachel Sippy, Marc de Kamps, Andrew Beam, Stefan Konigorski, Professor Christoph Lippert, Professor Mark S Gilthorpe, and Peter W G Tennant. The summary was written by Jessica Woodgate, who tried to synthesise everyone’s contributions to this document and the discussion. “We” = “someone at Data Ethics Club”. Huw Day helped with the final edit.

Article Summary#

Precision medicine attempts to identify healthcare pathways through the needs of individuals rather than the “average person”. Machine learning (ML) has been suggested as a technique that can be used to tailor therapies to each person as an individual, as well as automating diagnosis and prognosis. The paper questions the capabilities of ML with respect to precision medicine, asking whether it is realistic to assume that ML could achieve accurate and personalised therapy. Firstly, there is a lack of robust scientific evidence to support the superiority of ML over health professional assessment in diagnosis accuracy. Secondly, it is unlikely that ML will identify the best treatment for individuals, as causal inference is difficult to achieve without making assumptions based on scientific insight. Most health states are so complex that the chance of something happening is extremely difficult to predict at an individual level. The complexity of health conditions is not resolvable by collecting more data or building more elaborate models, as there are fundamental limitations in our understanding of physical and biological processes.

The paper suggests that it may be more pragmatic to aim towards stratified medicine, identifying and predicting subgroups, rather than personalised medicine. However, due to the challenge of differentiating true signal from noise and the different between deducing association vs causation, this route is more complex than simply applying ML to data. ML does have potential to advance scientific knowledge, but it is important not to give overinflated promises. If rhetoric surpasses actual capabilities, public trust in ML could be irreparably damaged when ML does not meet those expectations.

Discussion Summary#

How should we more meaningfully assess applications of ML to medicine (and other fields)?#

To meaningfully assess applications of ML, it is important to fight against the hype surrounding AI. A part of countering hype involves being intentional with how we talk about ML and how we foster critical thinking. As a society, we aren’t effectively teaching how to analyse AI. Healthy scepticism should be baked into education to prevent people from just accepting the information that they are presented with. When discussing AI tools, as well as talking about the potential of what they can do, we should carefully consider their mechanics and where the “learning” comes from. We wondered about the kinds of skills necessary to critique AI tools, and whether current medical professionals are equipped with those skills.

How we assess ML changes if we consider it as a tool versus as a replacement for a doctor. The paper seems to be warning against implementing ML to replace medical professionals. We also felt it important to consider ML technologies as tools to enhance practices, rather than replace what is already there. There is a concern that using ML for precision medicine could be an attempt to take clinicians out of the loop, as the people who build and purchase ML are not necessarily those who will be in clinical practice. Precision medicine tools could end up similar to digital pregnancy tests, which are now much more common than non-digital tests that require a medical professional. It is common sense to compare the outcomes of a model against the advice of medical professionals, looking for meaningful outcomes and using expert opinion as context. An actual clinician can give a contextual perspective of what the most likely outcomes are. The ability of ML to unpick complex data without worrying about the clinical context might be a strength rather than a weakness if we view ML as an assistive tool rather than replacement for a doctor.

Using ML as tools to assist clinicians, rather than replace clinicians, can free up clinician’s time to do other things. It seems more appropriate to apply ML to admin purposes than to medical decisions. Having ML cover mundane admin tasks could add a lot of value, as admin can be much more of a burden to medical practitioners than the medicine itself. The NHS trialled a tool that would summarise GP consultations and provide suggestions; one GP said that “this was the first day of my job that I ever left on time”. Stratifying cases could be a valid application, helping to minimise the risk of catastrophic mistakes. A stratified approach could help to raise the most at-risk patients to the top of the list so that they are seen sooner. Partitioning images to find the ‘interesting’ areas could be a worthwhile application. There are also many complex conditions which are hard to categorically diagnose, which clinicians address by asking more and more questions and slowly ruling things out. ML could help clinicians rule out possibilities sooner.

ML should be assistive rather than as replacement because ML isn’t actually very well designed for a lot of problems. There is a narrative around ML as a magical fix-all solution, but its capabilities are often overhyped – “when you have a hammer, every problem looks like a nail”. In an industry that prioritises innovation, there is a lot of intentional ignorance and bullshitting. Some of the discourse around ML hype comes also from politicians, looking to resolve workforce issues for example by replacing radiologists. Radiology is a good example of an area of medicine which has a lot of ML attention, but there are less urgent settings outside of medicine (e.g. astrophysics) which are less likely to get the same amount of funding.

Hyping up ML tools is exacerbated by technology companies who promise to solve problems that they don’t really understand. Studies have found that digital health companies have a lack of understanding of clinical robustness. A lot of the time, it seems that the intention behind development is to find out what can be done, rather than what should be done. The technology industry has a tendency to develop tools for the sake of it, before figuring out if the use case is valid. Launching projects before properly assessing their validity is amplified by a lack of user-centric design. To evaluate whether we should be using ML in healthcare we should be asking what the biggest barriers to achieving positive patient outcomes are, including whether algorithms are on this list, or whether outcomes are affected more by the workload of clinical staff and cost/access to healthcare.

Unsuitability of ML for certain tasks is highlighted by the opaqueness of its decision-making and how easily this procedure is biased. As a human, it is challenging to understand if the model has come to a decision for what we would consider to be valid reasons, or because (for example) the model is good at identifying dark spots on the left hand side of an image. ML is highly affected by the quality of training data it receives; if white skin is overrepresented then models will not work as well for black skin. Whilst startups may go through all the appropriate regulation and testing procedures, they could still be using algorithms that are ten years out of date because of cost and convenience, rather than newer methods which might be more suitable.

In addition to issues with transparency and bias, difficulties with data collection make the usual ML methodology inappropriate in the context of medicine. ML generally works best by getting all the data you can and letting the algorithm decide what’s relevant. Many medical ML tools use big data and deep learning to discover when conditions are interconnected through common pathways (e.g. omics and big data). ML with massive datasets has its uses, but relying on manually collecting data is unlikely to be useful. In many cases, gathering lots of data simply isn’t feasible, such as with patients for whom intrusion and unnecessary procedures are harmful. Collecting data about health conditions may also contribute to class imbalance problems, where there are disproportionately more data available for some classes than others. We are only putting things into the model which we have already thought to measure; there may be many more factors that we aren’t aware of which are influential over the state of a health condition. We did not think it possible to predict individual outcomes from population data, as inferring individual outcomes is intensely complex. For some conditions, the state of a disease is on a continuum of risk which is not fully understood, making it difficult to use those conditions as training sets. When working on an individualised basis, it is difficult to identify what is actually working versus what is a placebo effect. The bodies and health of individuals are constantly changing; even if it is possible to decipher one point in time for one individual, it will quickly change.

Whilst there are challenges with collecting data in sensitive applications such as medicine, extra data would undoubtably improve ML models - at least for stratified groups. Beneficial data would include previous medical history and longitudinal studies collecting samples from patients, regardless of disease condition. One could also examine whether there are genetic markers for neurodivergence and look at ethnic populations where certain conditions are more prevalent.

However, rather than just trying bigger and better models, it is important to weigh up the financial benefit of increasingly complex versus more simple models. We would like to see more evaluations of the potential gains of adopting more complex models such as cost/benefit analyses. Having an unexplainably complex model is unreasonable if the people who actually use it won’t be able to understand it. If users have a lack of or incorrect understanding, there is a higher chance that they will use it badly. Making deterministic inferences which humans are able to interpret is more valuable than a ‘magic box’ producing an output that the user has to blindly trust.

Why do you think most of the reviewed ML methods are producing classifications (i.e. diagnosed or not diagnosed) instead of predicting a continuum of risk?#

Producing classifications instead of prediction continuums may provide benefits by reducing the number of possible outcomes. Simplifying outcomes through classification can reduce variability of model performance, as there are less possible outcomes to predict. Using classifications instead of continuums also has benefits for human-readability. Even if an outcome does naturally exist on a continuum, humans would often prefer to conceptualise it as a classification in order to make the outcome easier to understand and explain.

One of the reasons it is tricky to understand ML models is because of the difficulties with communicating uncertainty. It is important to accurately portray risk and uncertainty, but there is a gap between the statistical literacy of the general public and current ways of explaining uncertainty. In terms of public education, there are lots of misconceptions about how to use scatter plots to show relationships and confidence intervals. Regarding the models themselves, often classifiers do not tell you the certainty with which they produce classifications. ML will typically produce a ‘final’ dataset or answer given the patterns that the model has detected. Clinicians, on the other hand, naturally integrate uncertainty into their processes, as they are trained to become more certain in their understandings by asking the right kinds of questions.

The way uncertainty is presented can significantly affect the implications of the output. For example, stating that “this person is 60% ready for discharge” is quite different to “this person is ready for discharge”. In making ML models more interpretable, it is important to carefully consider the level of detail necessary to communicate the right meaning.

Tolerating uncertainty is especially difficult for patients; an incentive for implementing AI in medicine is the hope that AI can increase the certainty of doctors. However, the promise of AI which can provide ‘objective’ answers with certainty is misleading as it will be influenced by the biases of those who developed the tool. Automating Inequality by Virginia Eubanks explores how values of the people who develop ML tools are baked into those tools. Historically there has not been a lot of diversity in developers of ML tools, limiting the perspective of values integrated into tools with adverse consequences for those affected by the tools’ use.

Do you think precision medicine itself is an epistemological dead end? What about stratified medicine? (identifying and predicting subgroups with a better and worse response)#

To help unpick the definitions of precision and personalised medicine, we thought that the definition of personalised medicine could have been drawn out more clearly in the paper. Regarding the distinction between personalised and stratified, we wondered if stratification was a ‘type’ of personalisation, and how valuable the distinction is to understand which type of personalisation is most appropriate for particular applications. We wondered if it would be possible to have precision medicine tailored to a specific genome to target a tumour or ‘reprogram’ the immune system.

Precision medicine seems to be surrounded by a lot of misconceptions, especially regarding how to make inferences from data. It is not possible to make inferences based on data that we don’t already have. We need to be honest about what we don’t know, which in terms of causes for individual’s medical conditions, is a lot. Diversity in data is important to make classifications; if you have 100 patients and only 1of them has a disease, it is unlikely that the model will pick the disease up. Difficulties with missing or sparse data make it hard to think about how it is possible to use ML to personalise medicine in a way that will truly add value.

Attempting to make inferences from non-existent data is another example of an ‘AI’ problem which is actually a data problem. Quite often, companies will think that they need AI when they actually have a data problem that can’t be solved with AI.

In addition to issues with missing data, precision/personalised medicine has a causal inference problem, as discussed in the paper. Those of us who have worked with prediction models have encountered the causal inference problem, frequently asking “it’s interesting, but what’s the utility?”. Outcomes, especially in healthcare, are very complex and it is extremely challenging to know which one of many potential mechanisms was the most effective.

With stratified medicine, it is possible to perform statistical analysis on groups of people, predicting likelihood based on the average of a sample. With precision medicine, people seem to be treated as a one-off certainty. Treating people as a one-off certainty doesn’t make sense statistically; it thus doesn’t seem reasonable to expect an ML model to make accurate one-off predictions.

To illustrate why it seems unreasonable to predict individualised outcomes, we discussed some of our research into predicting horse fatalities. In the USA and Canada, the incidence of fatality among racehorses is about 1.3 per 1000 race starts. A model predicting which horses will die will never beat the ideal “horses are immortal”, which is correct over 99.8% of the time. As what we are looking for (a horse death) is a sparse data point, which ML finds difficult to accurately predict, it is not a productive use of time to build a model which tries to predict individual fatalities. Instead, we can focus on using explainable models to understand the risk factors for horse fatality. We can then use the information we know about risk factors to identify the highest risk horses based on their individual histories. In this way, we can explain to the industry what interventions they could put in place to reduce the risk of horse fatality. The result of our ongoing work is that the incidence of horse fatalities in the USA and Canada have decreased by 38% since 2009. We asked: if we’d spent our resources on a fancy, computationally expensive, unexplainable ML model would it have the same impact? In our informed reckon, absolutely no chance.

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Appropriate benchmarks will need to be decided on to measure the success of precision medicine applications, for instance, if we are comparing outputs with how often they match the answers clinicians give, or how successful the ML outputs are in treating patients compared to treatments from clinicians. It is important to look at outcomes and processing speeds; for example, if ML can catch conditions early it may improve survival chances or more appropriate treatment options. We were uncertain as to how it would be possible to test the effectiveness of ML, and whether it would be ethical to run a clinical trial comparing medical professionals against precision medicine tools.

In addition to assessing the accuracy of tools for precision medicine, it is important to think about the effects of predictions regarding privacy and security. Unintended effects can be especially impactful in sensitive applications like healthcare. For example, disclosing private details has significant repercussions as in the example of Target predicting that a teenager was pregnant before she had told her family, discussed in both Hello World by Hannah Fry and Data Feminism by Catherine D’Ignazio and Lauren F. Klein. We also had concerns regarding how precision medicine tools would store personal data; companies such as 23AndMe are potentially at risk of selling customers’ data as the business is undergoing significant leadership changes.

In the world of equitable health, we wondered what the cost implications of fields like precision medicine are. If precision medicine were to become an effective pathway, the amount of resources it requires (e.g. processing of data, generation of individual treatments) suggests that precision medicine could be out of reach to the masses, furthering the gap between the ultra-rich and everyone else. We see similar problems in domains like preimplantation genetic testing which evaluates embryos for single gene disorders before transferring them to the uterus.

Instead of wasting resources on fancy models which are inappropriate to be deployed in reality, we should spend those resources supporting doctors better which will in turn support patient outcomes. Reducing workloads of clinical staff would also give them more time to properly engage with research, cultivating a more appropriate array of intersectional perspectives.

Integrating ML tools into human professions may have adverse consequences for intellectual capabilities. In the Cautionary Tales podcast, Tim Harford discusses how relying on AI to support us with tasks that require skills can deskill us, as we miss the opportunity to build experience in those skills leaving us less capable when the AI fails. For example, using co-pilot and ChatGPT in coding classes stops people from getting help from other humans and learning the processes themselves. It is crucial that we use AI for tasks which won’t deskill our working population. To think about how AI can help us now, we also need to think about where the gap in human knowledge will be in ten years’ time.

Attendees#

Huw Day, Data Scientist, Jean Golding Institute, LinkedIn
Amy Joint, Programme Manager, ISRCTN Clinical Study Registry, LinkedIn, [Twitter] (https://twitter.com/AmyJointSci)
Dan Levy, Data Analyst, BNSSG ICB (NHS, Bristol), LinkedIn
Euan Bennet, Lecturer, University of Glasgow
Virginia Scarlett, Data and Information Specialist, HHMI Janelia Research Campus

Data Ethics Club: ChatGPT is Bullsh*t Data Ethics Club: Transparent communication of evidence does not undermine public trust in evidence

09 October 2024

Recent Posts

Categories

Tags

Authors

Archives

Data Ethics Club: Time to reality check the promises of machine learning-powered precision medicine #

Article Summary#

Discussion Summary#

How should we more meaningfully assess applications of ML to medicine (and other fields)?#

Why do you think most of the reviewed ML methods are producing classifications (i.e. diagnosed or not diagnosed) instead of predicting a continuum of risk?#

Do you think precision medicine itself is an epistemological dead end? What about stratified medicine? (identifying and predicting subgroups with a better and worse response)#

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Attendees#

09 October 2024

Recent Posts

Categories

Tags

Authors

Archives

Data Ethics Club: Time to reality check the promises of machine learning-powered precision medicine#

Article Summary#

Discussion Summary#

How should we more meaningfully assess applications of ML to medicine (and other fields)?#

Why do you think most of the reviewed ML methods are producing classifications (i.e. diagnosed or not diagnosed) instead of predicting a continuum of risk?#

Do you think precision medicine itself is an epistemological dead end? What about stratified medicine? (identifying and predicting subgroups with a better and worse response)#

What change would you like to see on the basis of this piece? Who has the power to make that change?#

Attendees#

Data Ethics Club: Time to reality check the promises of machine learning-powered precision medicine #