What do we want from fair AI in medical imaging?
I have been thinking recently about the issues of bias and fairness in AI and how they relate to healthcare, and more specifically medical imaging. I wanted to write this blog for several reasons. First, I’m not writing this to try to educate people because I don’t think I’m the right person to do that and I honestly don’t think I have the knowledge and expertise to do it well. Partly, I wanted to use this blog as an exercise to organise my thoughts because the more I thought about it, the more complex and thorny the issues became in my mind. But I also wanted to set out what I see as the important considerations for fair AI in medical imaging to try to stimulate discussion and sharing of ideas. I certainly don’t have all the answers, and moreover I suspect nobody does yet, but I do feel strongly that these are issues that we should be discussing as they could potentially impact the quality of healthcare that we all receive in the future.
Fairness in AI
First, some background. Fair AI is a relatively new field but a very active one. I won’t attempt to give a thorough review of the state-of-the-art, as there are several review papers that do this quite well. I would suggest the interested reader checks out the papers by Mehrabi et al and Du et al as good entry points to the field. But basically, fair AI partly involves assessing the performance of AI models (mostly classification models so far) for different patient subgroups. Note that in fair AI, these subgroups are commonly referred to as protected groups and the variables used to form the groups (for example, race or biological sex) are known as protected attributes. For example, a landmark paper by Buolamwini and Gebru found that commercial gender classification models performed better for white subjects because they had a higher representation in the training set. In addition, fair AI also deals with methods to mitigate these biases. Once we start to try to mitigate bias we have to consider what we actually mean by the concept of fairness. A number of definitions exist, such as demographic parity, equalised odds, equal opportunity, etc. (see the Mehrabi et al paper for more details on these and other definitions of fairness), and I’ll come back to the idea of fairness definitions a bit later.
But if fair AI in general is a new field, fair AI in medical imaging is even newer. It’s only really in the last couple of years that papers have started to emerge on analysing bias in medical imaging-based AI. Of note here I would highlight the work by Larrazabal et al and Seyyed-Kalantari et al, who found bias due to gender, race and age in AI classifiers for X-ray imaging. We have also performed some work in this area, finding a racial bias in AI models for cardiac MR segmentation (see Puyol-Anton et al). Other work, not strictly focusing on bias but certainly related, has shown that race can be predicted from different medical imaging modalities (see e.g. Banerjee et al). This work is relevant because if there are features in the images that can be used to identify race (and other protected attributes) then there is the potential for models trained using imbalanced data to be biased. Using biased AI systems in healthcare could be extremely damaging. For example, in our work (Puyol-Anton et al, preprint) we have shown that less accurate cardiac MR segmentations for under-represented protected groups would result in higher errors in derived biomarkers such as cardiac ejection fraction. Ejection fraction is one of the main biomarkers for diagnosing heart failure, so a biased segmentation model could lead directly to higher misdiagnosis rates for heart failure for some protected groups, potentially denying people life-saving treatment. Personally, I find the small number of papers in fair AI for medical imaging, compared to the number in computer vision and general machine learning, quite surprising, since the potential impact on people’s lives of biased AI systems in medicine is obviously huge. Let’s hope this situation changes in the next few years.
It is important to consider the context of AI in medicine. In particular, I think there is an increasing realisation that we are not dealing with a situation in which modern medicine is universally fair and when introducing AI techniques we just need to maintain that fairness. On the contrary, there has been a lot of work that has highlighted disparities in the quality of healthcare received and health outcomes, for example by race or sex. As an example of this, body mass index (BMI) is commonly used to assess adiposity (having too much body fat) but its “normal” ranges were based upon studies on white males, meaning that clinical decisions made based on BMI may be biased against other protected groups. This means that one of our motivations when developing AI in medicine should be to address these inequalities and to improve healthcare for groups that currently experience poorer outcomes, and this is where fair AI has a role to play.
Now let’s return briefly to the idea of how to define fairness and consider a simple example from medicine for illustration purposes. One common definition of fairness is demographic parity, which states that (for classification problems), each protected group should be equally likely to be classified as positive. Suppose we have developed an AI model to diagnose a certain disease and we want to analyse its fairness in terms of demographic parity. We test its performance for different races and we find that it diagnoses this disease much more often for race A than for race B. Does this mean we have a problem with bias/fairness? What if the disease happens to have a higher prevalence in race A than race B? In this case the model would be correct (and fair) in giving a positive classification more often for race A. In such situations, other definitions of fairness such as equal opportunity (equal true positive rates for each protected group) or equalised odds (equal rates for true positives and false positives) would seem more appropriate. This may seem obvious but I think it illustrates that in medicine it is extremely important to consider the medical context when defining and attempting to ensure fairness. And this is before we even move on to other types of problem, such as regression and segmentation, for which fairness definitions can be even more problematic.
What do we know and when do we know it?
But actually I think the issue of defining what we want from fairness in medicine is even more complex than this, as I hope to outline now. Before we consider what we can do about fairness we need to think about what information we can reasonably expect to have about protected attributes. Furthermore, when are we likely to know this information, i.e. during model training and/or during model application? It is tempting to assume that we will always know e.g. the sex or race of a subject when training a model, but there are good reasons to suspect that this might not be the case. For example, if we are training a model using a large-scale database that has already been acquired, the protected attributes might not have been recorded. This may have happened for privacy reasons or simply oversight, i.e. it wasn’t believed to be important at the time. What is known at inference time (i.e. model application) really depends upon what we plan to do with the model. Inference may, for example, consist of applying the model to an unseen subset of the same database used for training. This would be the case if we wanted to derive a new biomarker for subsequent clinical use. Alternatively (and maybe additionally), inference might consist of applying a trained model to newly acquired clinical images, e.g. for automated diagnosis in a clinical workflow, like the disease diagnosis example given above. In this case we likely do know protected attributes at inference time, but again there may be privacy reasons for them not being available to the model. Even if protected attributes are available, attributes like race and sex are not as clearly defined as we might think. For instance, in the (relatively few) public databases that do record race, “mixed race” is considered to be a single category, meaning that a subject with Chinese/Caucasian race is considered to be the same as one with Black/South Asian race.
So now let’s consider two different “use cases” for fair AI and think about what we might want in terms of fairness in each situation.
We know nothing about protected attributes during model training: This might be the case, for example, if we are trying to learn a new biomarker from a previously acquired imaging database which did not include protected attribute data, as described above. In this situation, we might know the protected attribute when the biomarker is used later in clinical practice, but when training and evaluating the AI model we will know nothing. The biomarker might be biased, but we won’t know for sure whether it is or not. We might think that in this case there is nothing we can do during model training, and that any assessment of bias would have to come from gathering data after deployment of the model. But there is some very interesting research in this area from the general machine learning community that shows that we might, after all, be able to do something to mitigate potential bias in such situations. For instance, the work by Lahoti et al tries to make use of “correlates” of the protected attribute to identify potential bias. As a concrete example, even if the protected attribute of race is not available to the model at training time, ZIP code (postcode in the UK) might be and there is likely a correlation between ZIP code and race. This means that, even if you don’t know which variables are correlated with the protected attribute, misclassifications due to bias are more likely to “cluster” in the representation space of the model, and so can be considered to be “computationally identifiable”. The method makes use of this computational identifiability to minimise potential bias in the model. However, I suspect that we will always be limited in what we can do without knowledge of protected attributes during training, so I’ll leave this for now and move on to a scenario in the field of medical imaging where we will be more likely to be able to mitigate bias.
We know protected attribute status during training: This might be the case if we use a large-scale database (with protected attributes) to learn a new biomarker for e.g. disease diagnosis. Therefore, protected attribute status is known during training and model evaluation, and we can make use of it if we want to reduce bias in the model or models we produce. However, in order to be able to decide what, if anything, we do with this information, we also need to consider what we will know when the model is deployed. First, it is possible that we won’t know the protected attribute status at that stage. There might be privacy reasons for this, or we may just not particularly trust the information we do have (e.g. the mixed race example given above). But in many cases we will know the protected attribute(s) at deployment time. How will this affect what we do at training time? I will consider this in more detail in the next section.
First, do no harm
Although the precise oath taken by medical practitioners these days varies, the historical maxim of “first, do no harm” will be a useful framework for considering bias and fairness in AI. So let’s consider our different scenarios in this light.
For point 2 above, consider first what we should do if we will not know protected attribute status when the model is deployed. In this case, applying a biased model may harm groups who were underrepresented in the training data. So should we train a “fair” AI model that performs equally well for all protected groups? But this might “harm” the majority group by performing less well for them. It might even perform less well overall, i.e. considering all of the subjects used for validation (which might be imbalanced, admittedly). So do we accept this “harm” to avoid the arguably greater “harm” of perpetuating healthcare inequalities? It’s questions like these that I don’t think should be answered (just) by computer scientists like me, but it is our role to make others aware that these are valid questions, the answers to which will affect what we do as AI developers.
Now let’s consider what we should do if we do know protected attribute status at model deployment. In this case, should we still deploy our “fair” AI model that minimises disparities in performance for different protected groups? Consider the situation where we have a model that performs very well for race A who were well represented in the training dataset, but less well for race B who were less well represented (i.e. it is a biased model) – do we really want to apply a fairer model for race A when we know we have a better model for them (i.e. the biased one)? This could potentially do harm to members of race A because we would be applying a model that performs less well for them than another one we have chosen not to use. But likewise, we should not be using the biased model for race B if the fairer model results in better performance for them. In this case, perhaps we should be “cherry-picking” the best model for each protected group? For example, we could use the biased model for race A and a different (likely fairer) model for race B? Race B’s “best” model might be one that was trained using only data from race B, or it might be one that was trained using a fair AI technique on the entire (imbalanced) training set. If this is what we decide is the best approach, then the challenge for fair AI methods shifts from mitigating bias to maximising performance for (under-represented) protected groups, and this would be an important distinction to make which could affect future research efforts in this field.
To be honest, I don’t know the answers to these questions, but I don’t think it’s as simple as just applying a magic “fair AI” model to make the problem go away. Not always anyway. There are several different scenarios we need to consider depending on what we know and when we know it, and the answers in each case may be different. In my opinion it’s certainly not satisfactory to just ignore the problem. I also believe quite strongly that AI developers should not be making these decisions on their own. We need input from clinicians, medical ethicists and patients to help us find the right path forward. But now is the time to have these conversations because AI models are starting to be used in healthcare, and they may well be biased. We need to decide what to do about this and decisions we make now will shape the direction of future research in what I hope will be a growing field in years to come.