Quantified Health: The self-driving dermatologist

The application of Artificial Intelligence (AI) techniques, and in particular Deep Learning Machine Learning methods, to medical image analysis for diagnostic purposes is leading to a potential sea change in the practice of medicine. One example is using deep neural networks to decode breast biopsy samples to determine whether the slides contain cancerous cells or not. In this classification problem, the network predicts the label (cancer or normal) of the input data (biopsy slide image).

Dermatology is a medical field that lends itself well to automated object recognition from images because skin ailments are easily observed and documented by pictures. Indeed this diagnostic accessibility makes possible online dermatology services. Diagnosing whether a mole (or other skin lesion) is cancerous is one of the most important responsibilities of a dermatologist. Skin cancer is on the rise thanks to greater exposure to the sun and the increased popularity of practices such as tanning beds. Melanoma is the most deadly form of skin cancer leading to approximately 10,000 deaths per year in America.

Recently, a group at Stanford compared the performance of a neural network to that of human dermatologists with respect to classifying whether or not skin lesions were cancerous based on their images. Their Deep Learning algorithm was quite successful (The Verge):

"Researchers at Stanford University have created an AI algorithm that can identify skin cancer as well as a professional doctor. The program was trained on nearly 130,000 images of moles, rashes, and lesions using a technique known as deep learning. It was then tested head-to-head against 21 human dermatologists, where its creators say it performed with an accuracy on par with humans (“at least” 91 percent as good). In the future, they suggest it could be used to create a mobile app for spotting skin cancer at home."

The Abstract of their paper provides additional details of their work:

"Here we demonstrate classification of skin lesions using a single CNN, trained end-to-end from images directly, using only pixels and disease labels as inputs. We train a CNN using a dataset of 129,450 clinical images—two orders of magnitude larger than previous datasets — consisting of 2,032 different diseases. We test its performance against 21 board-certified dermatologists on biopsy-proven clinical images with two critical binary classification use cases: keratinocyte carcinomas versus benign seborrheic keratoses; and malignant melanomas versus benign nevi. The first case represents the identification of the most common cancers, the second represents the identification of the deadliest skin cancer. The CNN achieves performance on par with all tested experts across both tasks, demonstrating an artificial intelligence capable of classifying skin cancer with a level of competence comparable to dermatologists."

Previous automated image classification methods relied on "extensive preprocessing, lesion segmentation and extraction of domain-specific visual features before classification." In other words, there was a lot of "data wrangling" of the input data before it was fed into the machine learning program. By contrast, the new system "requires no hand-crafted features; it is trained end-to-end directly from image labels and raw pixels." Once you assemble your training set of images labeled with the type of lesion, then you are good to go. No time needs to be spent trying to further analyze or process the data.

Interestingly, the Stanford researchers employed a standard off-the-shelf Convolution Neural Network (CNN) developed by Google that was pre-trained for general object recognition. Transfer learning was then used to train the network on the specific skin cancer dataset. A convolution net is a deep neural net (containing many hidden layers) that uses a bio-inspired architecture (i.e. modeled after the vertebrate visual system) involving tiling of the image into overlapping receptive fields, and the convolution operation averages information in the receptive field. Transfer learning is the process in which the network is first trained on a general training set so that it can learn general features of object recognition (e.g. edge detection), and then trained on a domain-specific dataset to learn features particular to the task at hand (e.g. recognizing melanoma).

A cross-validation approach initially assessed performance in which the dataset was broken up into various splits of training and test data. For this initial evaluation, they attempted a three-class prediction (normal, benign, malignant), and compared their results to two human dermatologists. The CNN achieved 72% overall accuracy compared to the 66% accuracy of the dermatologists. Note that in this dataset the labels were assigned by experts and not necessarily verified by biopsy.

In the second more conclusive validation, they tested their convolution neural network on the gold-standard biospy proven images using a two-class prediction (cancer versus not cancer) over three different datasets: 1) carcinoma (e.g. basal cell carcinoma the most common skin cancer), 2) melanoma (the most deadly skin cancer), 3) melanoma dermoscopy images. A dermascope is a fancy magnifying glass for obtaining an enlarged view of the skin lesion.

The results were striking in which the neural network exhibited both high sensitivity and specificity. As a reminder. a prediction is sensitive when most people with the disease (cancer) will be detected (predicted correctly). If a test is specific, then people without the disease, will not be detected. There is a tradeoff between sensitivity and specificity depending on the threshold chosen for the prediction. One can graph the tradeoff by a Receiver Operating Characteristic (ROC) curve (Figure 1). By calculating the area under the curve (AUC), one can determine the overall accuracy which ranged from 91% to 96% for the three test sets, and that exceeded the performance of the dermatologists whose predictions (red dots) were mainly below the ROC curve (blue line). The area under the curve is a measure of the probability that a typical true positive will be ranked higher than a typical true negative.

Figure 1. Neural network performance compared to human dermatologists plotted as a Receiver Operating Characteristic curve (specificity versus sensitivity); from Figure 3 of Esteva et al. Nature, 2017.

It is apparent from Figure 1 that the doctors are being conservative, emphasizing sensitivity over specificity (i.e. making sure that as many true positives are diagnosed as positives). This caution is prudent so that when in doubt they flag a mole as potentially cancerous to be biopsied. The computer was able to reduce the number of false positives without sacrificing sensitivity, achieving a high level of both specificity and sensitivity.

One intriguing aspect of deep neural networks like the CNN is that they form internal representations of the images they are classifying in the intermediate layers of the network. These internal representations represent exemplars or archetypes of the different types of lesions. Thus, by projecting the hidden layer information into a lower dimensional space, one can obtain a map showing how the different images cluster into different groups (Figure 2). The informativeness of these clusters is a reflection of the prediction accuracy quality.

More than any algorithmic innovation, the key advance of this work was collecting a very large dataset of labeled images. The 130,000 labeled clinical images was roughly two orders of magnitude larger than previous datasets. The impact of large amounts of data (i.e. Big Data) on improving prediction accuracy is a common theme in the machine learning community.

The goal is not to replace dermatologists with a self-driving version, but to complement human expertise. First, dermatologists can use the neural network as an aid in diagnosis for particularly difficult assessments. Second, for those in remote areas in which there is a shortage of doctors, the tool can extend the reach of this diagnostic capability. The program can be run as an app on a smartphone so that a patient takes a picture of the mole and then uploads the picture into the app.

Note that training the deep CNN may require extensive computational resources (i.e. thousands of processors in the Cloud), but after the training is completed, the neural net can be run on the no-frills CPU of a smartphone. Thus, the time-consuming part is learning the network weights, but once they are determined, these weights can be hard-coded into the app. Given the widespread proliferation of low-cost smartphones, literally billions of people could have access to this diagnostic tool.

This result is quite impressive, and I expect to see improvements in other areas of image-based diagnostics. Indeed, the authors conclude their paper stating "[t]his method is primarily constrained by data and can classify many visual conditions if sufficient training examples exist. Deep learning is agnostic to the type of image data used and could be adapted to other specialties, including ophthalmology, otolaryngology, radiology and pathology."

Figure 2. Taxonomy of skin leasions clustered by using dimensionality reduction algorithm on the high dimensional space of the last hidden layer of the convolutional neural network predictor. Similar lesions tend to cluster together in this two-dimensional plot (from Figure 4 of Esteva et al. Nature, 2017).

Quantified Health

Pages

Sunday, February 12, 2017

The self-driving dermatologist

No comments:

Post a Comment