Garbage in, garbage out: How bad use of data harms marginalised groups#
Whatâs this?
This is a standalone blogpost written by Euan Bennet and Hannah OâDonoghue in September 2022. Euan Bennet is a university lecturer and Hannah OâDonoghue is a trainee nursing associate and disability advocate. All opinions expressed are strictly their own. Nina Di Cara, Natalie Thurlby and Huw Day helped with the final edit.
Introduction#
A common theme in data ethics discussions is the harm that can be caused to marginalised groups in society through the bad use of data. There are many, many manifestations of this â usually unintentional â harm, and the responsible data ethics ethos is that data scientists should be aware of this potential and take steps to avoid it from happening.
In this blog post we want to present a general example of an area where the bad use of data has historically caused, and continues to cause, harm to one particular marginalised group of people. As a specific example we will address a recent paper which is an excellent example of this. This post is by no means intended to be exhaustive and is intended to serve as just one example of a field where âdata hazardsâ are commonly found.
The field which this blog post seeks to highlight is research into the genetics of autism. This is a contentious area because of the history of autism research and the related treatment of autistic and neurodivergent people. There is far too much to get into here, but suffice it to say that many autistic people are generally squeamish around any research relating to potential screening of autism in utero, or of a âcureâ, due to the potential route to eugenics that such tools might very well enable. For the purposes of this blog post however, we want to highlight how studies involving genetics and autism cause harm by not only furthering these routes of investigation, but also just being based on bad data and thus producing inherently unreliable results.
The three fundamental problems affecting the wider field of autism research are:
Every data set of people âmedically diagnosedâ as autistic are heavily biased, because of systematic under- and mis-identification of autistic people who do not fit into harmful stereotypes.
Any focus on genetics ignores a whole host of social and societal factors which cannot be neglected as part of a complex picture.
Many researchers continue to frame autism entirely under the medical model of disability (hence the use of medical databases), and completely ignores the social model of disability which is a far more appropriate framework for properly understanding autism and other neurodiverse neurotypes.
Systematic bias in the data#
Autism is commonly under- and mis-identified in several groups in society. The reasons for this include historical biases dating back to the earliest identification of the neurotype - even today the myth that âautism is only found in boysâ is pervasive. This is compounded by historical biases in identifying autism in anyone who isnât a white boy conforming to a very narrow range of behaviour stereotypes. Assumptions based on outdated and wrong information have contributed to diagnostic bias against women, non-binary people, and anyone of colour.
Specific Example#
A recent paper published in Cell Genomics claims to have found evidence of a âfemale protective effectâ against autism. They combined two studies to reach this conclusion: an epidemiological study of a Danish medical database, and two genetic libraries. The authors claim that the findings âstrongly support a Female Protective Effect against Autism Spectrum Disorderâs common inherited influencesâ.
In the final section of this blog post, we will use this paper as an example of how biased data can be used to draw conclusions that have the potential to harm marginalised groups, and set out the case for doubting the conclusion that there is a âfemale protective effectâ against autism.
As discussed above, medical âdiagnosisâ of autism still suffers from historical biases and stereotypes which have yet to be corrected. As such attempting to draw conclusions about a âFemale Protective Effectâ using a data set based on past âdiagnosesâ is akin to the ghost of Steve Jobs conducting a survey of peopleâs favourite Android phones in the Apple Store: you just wouldnât be looking in the right place for the answers that you seek.
In focussing on genetic effects, the authors neglect all of the social and societal factors that further contribute to bias in the medical databases used in the study. Finally, the language used throughout the paper clearly demonstrates the authorsâ commitment to the medical model of disability, which adds a further layer of bias to the study design given the historic under- and mis-identification of autism in anyone who isnât a white boy.
It is perhaps beyond the scope of the paper to also consider the historic under- and mis-identification of people of colour, but this is not mentioned. The paper specifies that only people of âEuropean ancestryâ were included. This methodology is not clearly explained in the paper but this seems to be their euphemistic explanation that they only considered white people in the study cohort. This is a problem in wider genetics research as well.
Summary#
Many researchers studying the genetics of autism use data sets (medical âdiagnosesâ of autism) which:
Have an inherent systematic bias due to historical and current diagnostic criteria.
Fail to account for additional social/societal bias due to different experiences of socialised gender expression.
Fail to account for additional social/societal bias due to historic mis-identification of people of colour.
In the Cell Genomics paper we saw how the authors use these data to attempt to ascribe a genetic basis for the observations and extrapolate this to the entire population, without accounting for the acquisition bias inherent in the data set. To use a less technical term, this is a perfect example of âGarbage In, Garbage Outâ (GIGO).
Data Hazards#
The Data Hazards project exists to encourage data scientists to consider the ethical implications of their work. The creators have designed a series of labels which are similar to the classification system used for hazardous chemicals. As with chemicals, the intent is not to discourage people from using them, but to ask people to proceed with care. For further explanation of the Data Hazards labels, see https://datahazards.com/contents/data-hazards.html.
Letâs look at the Data Hazards that could apply to the research study we have been discussing:
Reinforces existing biases - it turns out that if you use data based on outdated and biased diagnostic tools then you get results that reflect the inherent bias of the data! This would include the underrepresentation of people of colour in the data, and the existing bias regarding the historic under-diagnosis of women that the data reflects.
Classifies or ranks people - this is more present in the language used to present the results (for example). One could equally conclude that âmaleness protects from neurotypicalityâ.
Lacks community involvement â the funders of this research advocate for âtreatmentâ and, like the researchers, clearly subscribe to the medical model of disability. Best practice for this research would have been to gather perspectives from those in the autistic community who prefer the social model of disability and taking care to avoid ableist language such as describing autism as a âdisorderâ, and describing autistic people as âdisorder casesâ.
Danger of misuse - this is the most serious Data Hazard involved in the study, because some of the conclusions are edging in the direction of eugenics. It would not be a stretch to imagine certain groups who have a desire to âcureâ autism latching on to this paper and using it as part of their campaign to actively harm autistic people. Research that wrongly attributes genetic causes to observations which are heavily influenced by social and societal effects could easily be misused and abused by those who wish to avoid addressing the societal effects.
Conclusions#
This paper is a fantastic example of how bad use of data can misrepresent marginalised groups in society and reinforce existing stereotypes. By framing these conclusions in ableist language and failing to even consider the social model of disability, many researchers in the genetics of autism are contributing to harming people who their work affects. Work such as this impedes progress to a more equal society, and could even be misused by those who have an interest in making society even less equal.
Do better, researchers !
Social and societal factors#
These historical biases are amplified by other social and societal issues. Autistic people report that one of the biggest challenges to thriving in a society designed around neurotypical social norms is the constant pressure to âmaskâ their autism by conforming to the expected norms of society which neurotypical people take for granted.
One reason for the under-diagnosis of autistic women is that societal pressures are much stronger on people socialised as women to behave in certain ways. As a result, it has been estimated that as much as 80% of autistic women are unidentified as autistic at age 18. Far from the 1:4 ratio of autistic women to autistic men that are currently recognised by existing diagnoses, attempts to estimate the true ratio have found it is at least 3:4. Again due to historical biases of diagnostic criteria, these estimates are by definition not exact, but equally there seems to be no evidence as to why the ratio shouldnât be 1:1.