Skip To Content
Cambridge University Science Magazine
Machine learning is a data-hungry field. In order to reliably find patterns in data, machine learning algorithms require extremely large datasets. Some of the cutting-edge language models such as GPT-3 even feast on the whole of the Internet. However, despite the trend of ever-growing datasets as society becomes more and more digitised, biomedical imaging remains an exception to this rule. Data from medical scans is not only costly to obtain as it requires radiographers and specialist equipment, but it is also subject to strict privacy laws. Therefore, some medical datasets can have at most 100-200 images — in contrast, one of the most popular image datasets, ImageNet, contains over 14 million annotated images. This is a major bottleneck for researchers that wish to use machine learning in medical imaging.

One way to circumvent the lack of data is to augment the available images to produce slightly different data that can be added to the dataset. For instance, given a data set of faces, we can change the eye colour or rotate and crop the images. However, this modified dataset does not capture the full variability of human faces and any model trained on this dataset would not perform well if it sees faces that it has not previously encountered.

Alternatively, one can use generative adversarial networks (GANs). At its core, a GAN pits two neural networks, a discriminator and a generator, against each other to generate synthetic data that mimics an existing dataset. The generator network creates data which is then provided to the discriminator network. The discriminator network then has to determine whether it is real or generated. The discriminator tries to improve its judging ability, whereas the generator tries to trick the discriminator (hence, the name ‘adversarial’). Initially, the generator produces a random output and the discriminator guesses randomly but eventually they learn from each other and the generator can produce convincing synthetic data. In an ideal scenario, the generator mimics the data so well that the discriminator can do no better than random guessing.

But why does this battle between networks result in the generator learning the characteristics of the real data? To explain this behaviour, consider the analogy of a rookie forger of Monet paintings and a novice art critic. Initially, the forger produces blobs and the critic cannot tell the difference between a real or fake. However, the critic notices that the forger is using the wrong colour palette and gains an upper hand. As a result, the forger has to adapt and learn the correct tones, forcing the critic to find another feature to distinguish the forgeries from the real paintings. Gradually, the forger learns the defining aspects of a Monet painting such as the colours and brush strokes and then the more subtle aspects such as the choice of composition. Eventually, the forger can produce work that is indistinguishable from real Monet paintings.

Although GANs were only first created in 2014, in the space of a few years, their output quality has improved to the point that they now can generate hyper realistic human faces. GANs can even tackle abstract tasks such as converting photos into Monet paintings and turning horses into zebras. In the context of medical imaging, GANs can create various synthetic datasets such as lesion data, MRI scans, and retinal images. Furthermore, image classifiers which use a combination of real and synthetic data tend to consistently perform better than those trained on real data alone on tasks such as tumour classification and disease diagnosis.

However, GANs are not a silver bullet for the problem of small datasets yet. Firstly, they are notoriously difficult to train. The success of a GAN relies on the delicate balance between the performance of the discriminator and the generator. If the discriminator is too good, the generator is not able to improve because there is no chance of it being able to trick the discriminator so it will never learn. Likewise, if the discriminator is too weak, then the discriminator cannot differentiate the synthetic data from the real data and so the generator is not under any pressure to improve. Another problem is that the generator might cycle between a handful of realistic outputs, thereby successfully tricking the discriminator but not producing outputs with similar variability to the real data. Finding ways to stabilise the training of GANs is a very active area of research.

GANs are also dependent on the quality of the data that it is being trained on. In machine learning, the term ‘garbage in, garbage out’ refers to the idea that if an algorithm is trained with bad data, then its output will be equally nonsensical. This applies strongly to GANs that are being used to synthesise data. A study in 2018 showed that GANs can hallucinate features when trained poorly. The researchers trained a GAN to convert brain MRI images into CT scans, but trained it exclusively on images without tumours. The resulting algorithm created realistic CT scans but also removed any tumours from an image, which could be very dangerous if the algorithm ever saw clinical use.

Furthermore, GANs can reinforce systematic biases within datasets. A review of publicly available skin cancer image datasets in 2021 highlighted the severe lack of darker skin types in lesion datasets. Therefore, a GAN that creates sample lesion images is unlikely to adequately represent darker skin types. Indeed, studies that try to generate lesion data with GANs rarely take darker skin tones into consideration, and therefore only exacerbate the existing inequality. If this synthetic data is then used to train algorithms to diagnose skin cancer, the resulting algorithm would not have had exposure to darker skin tones during training, which would reduce its diagnostic accuracy in patients with darker skin.

Nevertheless, GANs are pushing the boundaries of machine learning in biomedical imaging. The ability to infer from smaller datasets is an important problem that has held back machine learning to date. With data augmentation techniques such as GANs we can hope to see an explosion in applications of data-driven approaches to medical imaging problems in years to come.

Shavindra Jayasekera studies maths at Trinity College. Artwork by Biliana Tchavdarova Todorova.