MOTION MODELLING & ANALYSIS GROUP

Artificial Intelligence for Medical Image Analysis

Blog

Where next for fair AI in medical imaging?


By Andrew King, January 2024

Introduction

I’m not a prolific blogger. In fact, I’ve only ever written one blog post before and that was in late 2022. I was prompted to write that one because I felt quite strongly at the time about the importance of research into the potential for bias in AI in the context of medicine, and more specifically medical imaging. At the time, Fair AI in medical image analysis was a relatively new research field and it was not clear to me what the key challenges and priorities of this field should be. I tried to use the blog to organise my thoughts and stimulate some discussion about this as well as highlight the importance of research into Fair AI in a medical context. (I am not going to talk again about what Fair AI actually is. If you feel like you need a brief primer on this you can read my previous blog post or one of the nice review papers that have been written about the subject.)

Since then, things have moved on. Fair AI in medical imaging is becoming an increasingly popular area of research. At FAIMI (Fairness of AI in Medical Imaging) we have now organised two annual free online symposia, the most recent of which had an attendance of more than 160 people, and we have also held the first in-person academic workshop on the subject, as part of MICCAI 2023 in Vancouver, Canada. We are seeing more and more papers published on bias and fairness in medical image analysis and governments worldwide are introducing guidelines or legislation on responsible use of AI, most of which mention bias and fairness to some degree.

I’ve been quite encouraged by this upsurge in interest and in general I am now feeling a lot more positive about the future (equitable) benefits of AI in medical imaging. But this optimism is tempered slightly by a growing frustration that more is not being done. It is still the case that most research papers being published on AI for medical image analysis either do not mention bias/fairness at all or relegate it to a passing comment in the Discussion section. Companies are releasing AI products with apparently little transparency about the demographics of the data used to train the models, or about performance of the models for different demographic subgroups. One of the early and most influential works on Fair AI – the Gender Shades study on video facial analysis – was highlighting bias in commercial AI tools, so it is quite surprising and concerning that this is happening in such a high-stakes field as healthcare. There are also some positive signs, such as companies emerging to offer AI evaluation services including assessing demographic fairness, but in my opinion not enough is being done yet to address the valid concerns about AI bias in healthcare applications. The frustration that I feel about this is what has prompted me to write this (my second) blog post.

What I’d like to do here is to respond to what I see as the main perceived concerns that are preventing researchers from paying more attention to bias/fairness in their work. I say ‘perceived’ because what I hope to show is that these should not be obstacles to focusing on what I believe to be an essential prerequisite to gaining true clinical benefit from research into AI for medical image analysis. The points I list below are based partly on reading the literature, partly on personal conversations and partly just on thinking about what possible barriers people could be worried about. If I’ve missed any out, then please let me know!

Concern 1: I don’t know enough about the subject of Fair AI

Many people are reluctant to get involved in something that they feel they don’t know enough about. I can relate to this. I am personally often reluctant to move into a new research area because I think that there are many other people who have been working for years in this area and they will know much more than me, so what useful contribution can I bring? I would answer this point in two ways. First, Fair AI in medical imaging is still a very young field. The first papers only really started to emerge around 2020-21 so there isn’t really a huge body of literature to catch up on. At FAIMI we have made a resource list that highlights what we believe are some of the key works to help you get up to speed. So, it might not be as much work as you think to get involved. The resource list also contains links to some software toolkits that can help you to get started in Fair AI research, so you shouldn’t need to spend a lot of time on implementation. And I for one would certainly not resent other people trying to ‘muscle in’ on this field. I am very keen for as many people as possible to become active in Fair AI research and I think others within the field would feel the same. We are all working in this area because we believe it’s important and there are currently lots of aspects of Fair AI that are not being investigated because of a lack of people to do the work. Secondly, I think this concern stems from a slight misconception about Fair AI as a field. Whereas other medical imaging research fields like segmentation, diagnosis or image registration can be viewed as distinct fields of their own, I don’t think Fair AI is like that. Rather, I see Fair AI as a ‘cross-cutting theme’ that should impact all research areas. AI models developed for segmentation can be unfair, as can models for diagnosis or image registration and countless other areas. So, you don’t need to abandon your current area of expertise to be able to work in fairness research and you don’t need to ‘muscle in’ on somebody else’s research field. Rather, you just need to be aware of some basic techniques, tools and metrics to enhance your current research. Of course, if you really want to focus specifically on Fair AI you can do that too, but considering potential bias in your AI models should be for everybody, not just those with a specific research focus on Fair AI.

Concern 2: Fairness is not relevant to my application

Another common (mis)conception about Fair AI is that it is only applicable in certain scenarios. For example, many people are aware that AI models trained to analyse dermatology images can be biased by skin tone. In other words, if they are trained on mostly lighter-skinned subjects’ images the resulting models will not work so well on darker-skinned subjects, and vice versa. In this case there is an obvious (and visible) difference between the data of the two groups of subjects, i.e. their skin tones are different. This can clearly lead to a distributional shift between the imaging data of the two groups, which can cause bias in the internal representation of the AI model and consequent bias in performance. Another, slightly less obvious example is bias in chest X-ray classification models by sex. In this case the distributional shift is caused by the presence of breast tissue lowering the signal-to-noise ratio of the female subjects’ images. Again, we can quite clearly understand where the source of the bias is and can understand (perhaps retrospectively) why it was worth investigating. However, there are other applications where bias has been found where its presence was not so predictable. As an example, take our work on cardiac MR segmentation. I don’t think any cardiologist would claim to be able to predict a patient’s race from their MR scan, and probably most would have been sceptical about any possible race bias in AI models for cardiac MR image analysis. But such bias was found and cannot (yet) be attributed to any confounder. Furthermore, subject sex, which might be a more obvious source of potential bias in cardiac MR due to the difference in the size of the heart between men and women, was found to be much less significant. So, in short, it is worth investigating possible bias even if you are sceptical about whether it will be present (as, indeed, we were). What you find might surprise you, and lead to a new research direction opening up. Indeed, recent work has shown that demographic information such as race, age and sex can be predicted from a range of medical imaging modalities, so it seems likely that the distributional shifts that can cause bias in AI models are more widespread than we might think.

I would also add that the presence of bias is not the only thing that we should be interested in. Take the dermatology and chest X-ray examples mentioned above. In these two scenarios the characteristics of the bias were quite different. In the dermatology example worse performance in AI models was found for the underrepresented groups, whereas in the chest X-ray example performance was always worse for females, even when the model was trained solely on female images (although relative performance was still affected by the degree of training set imbalance). These findings have important implications for how we should go about addressing the bias, and so are important to understand fully. This means that it is always worth thoroughly investigating the nature of bias, even when we think there will or won’t be bias present.

Concern 3: My data doesn’t allow me to look into fairness

Many publicly available datasets of medical imaging data do not contain the associated demographic information which are necessary for most Fair AI research, and this can prevent researchers from looking into bias in their AI models. This is a reasonable point, and I would certainly argue that more datasets should be fully transparent and provide demographic information to enable evaluations of AI fairness. But this seems to me to be an easy ‘get out of jail free’ card. Should we just sit back and wait for others to address this problem, or should we all have an active role to play in promoting the move towards more transparent datasets? There are actually quite a few medical imaging datasets that do contain demographic information. The list of resources on the FAIMI web site contains a summary of some that we are aware of, but there are almost certainly others and a little research might uncover something that you weren’t previously aware of. Stephen Aylward’s excellent list of open access medical imaging repositories would be a good starting point for such a search. If the dataset you want to use does not provide demographic information, why not contact the providers to see if it can be added? There is also a case to be made that, wherever possible, we should try to use datasets that are transparent in terms of releasing demographic information. This might lead to a kind of ‘consumer pressure’ on those that curate and release such datasets – if they see that transparent datasets are being used (and cited) more it might help to make future datasets more transparent. The organisers of large conferences such as MICCAI could also have a role to play. These conferences typically host several challenges each year and the data used in these challenges often become the open access datasets of tomorrow, which can shape future research directions. Why not make it a requirement of all such challenges to release demographic data along with the imaging data? In fact, MICCAI has recently announced the concept of ‘lighthouse challenges’ at future MICCAI conferences (starting in 2025) which are intended to be held to a higher standard in terms of quality and impact. Why not make publishing of demographic information  a requirement for these?

Concern 4: Fairness is complicated, and I can’t solve all problems

The final concern I have heard expressed is that fairness is just too complex to fully address in every paper. People using this excuse might point out that demographics such as race, sex and age are just one source of variability in performance of AI models, but that there are others, such as acquisition details (e.g. MR scanner), pathology and other patient characteristics such as height, weight, clinical risk factors etc. To do full justice to the issue of variability in AI model performance you should consider all of these influences and their interactions, and this is beyond the scope of most papers. To an extent I agree with this point. Fair AI is a very complex subject, and there are indeed unanswered questions about how different (combinations of) attributes can affect model performance and how robust these are to different forms of distributional shift. This is why Fair AI has become the vibrant and exciting research field that it is today. However, just because you can’t do everything it doesn’t mean you should do nothing. If you have access to demographic information, it is easy to be transparent and report performance of your model for different subgroups. This will just add a few extra columns to your results table, and it could help to boost the impact of your work by showing that your model does not contain bias for these groups. If bias is observed it should not prevent your work being published, rather you should be commended for your transparency, and this could also open up future interesting work for yourself or others. In short, it’s just good science. The bare minimum is this simple transparency in reporting, and often you will not need to do more than this. Furthermore, there may be some ‘low hanging fruit’ that you can take advantage of. Simple bias mitigation techniques such as Group DRO or oversampling have been shown to be quite effective in some limited scenarios. So, if you do find bias why not try one of these techniques to address it? If they work this can also boost your research message and impact.

My final response to this point is that actually demographic attributes like race, sex and age are not just another source of variation in data and model performance, they are objectively more important than other sources. Often they are protected attributes, which means that it is illegal to discriminate against people based on these attributes in many countries such as the UK and USA. With existing or forthcoming government AI legislation and their focuses on fairness and bias, making your AI models fair based on these protected attributes is likely to become a legal requirement, so it is best to get ahead of the game if you are serious about your models being used in the real world.

Closing thoughts

I hope I have shown that the issue of bias and fairness in AI models in medicine is something that we should all be aware of and concerned about. If we want our models to be used in the real world and make a positive difference to (all) peoples’ lives, then it is important that they be trusted by patients and clinicians alike. This trust will not happen unless we are fully transparent about what our models can (and can’t) do. Therefore, considering AI bias and fairness shouldn’t be just a niche research area, it shouldn’t even be just a major research area, it should be something that all of us do as a matter of course. It should be like good validation practice. I hope that none of us would be happy to knowingly publish work with questionable validation results, since this would provide misleading information to others who read our work and try to build upon it. Scientific progress is made because of openness and transparency in reporting. I see reporting of AI bias as being similar to this. By not reporting the bias in our models we are hiding an important aspect of its performance. I’m not claiming that I have always been perfect in this regard, and I probably won’t be 100% perfect in the future either, but I can promise that I will try my hardest to do things in the right way, and I hope that others will join me in this. Let’s all try to make a ‘fairer’ future for AI research in medical imaging 😊.


If you would like to stay informed about Fair AI research in medical imaging, why not sign up for the FAIMI newsletter, which we will use to publicise future events?