Improving the classification of lung diseases in thoracic rontgen images

by | Nov 2, 2018 | Research

The volume and complexity of diagnostic imaging is increasing at a rate faster than the availability of experts who can interpret them. This has caused a great accumulation of undiagnosed patients who may need urgent care. Artificial intelligence (AI) systems have shown great promise in classifying two-dimensional images of common diseases and typically rely on databases with millions of annotated images [1].  

“Genematics Harvey” is an online biomedical platform developed by Genematics. The platform aims to deliver fast, accurate and in-depth insight in biological data. Genematics has developed an automated software module to detect tuberculosis infection on a single thoracic rontgen image with high accuracy, ~80% correct identification, in collaboration with HAN BioCentre.  

The image classifier has already shown great results but there is still room for improvement. At its current state it is only able to provide a percentage of how sure the system is. Joel van Wierik has worked on the classifier and recommend the following points for improvement: the classifier could be expanded to offer detection for a variety of lung related diseases,  the classifier could be more accurate by increasing the layers of the model and more training data.

The expansion of the implemented software module in “Genematics Harvey” to detect a variety of diseases with more layers will improve the performance of the classifier in comparison to the previous classifier and manual interpretation of a radiologist.

Materials and Methods
Retriever to collect data
For the purpose of training our classification model, the NIH Chest X-ray Dataset of 14 common thorax diseases will be used. This dataset has over 112K annotated images.
For the purpose of validating our classification model, an automated retriever script will be developed that can retrieve new annotated images from Open-I. Open-I service of the National Library of Medicine enables search and retrieval of abstracts and images from the open source literature, and biomedical image collections. It is comparable to PubMed but for images. Open-I provides access to over 7.470 chest x-rays from the Indiana University hospital network [2].
Apache Kafka is an open-source platform that can be used for building real-time data pipelines and streaming applications [3]. In our study, Kafka will be implemented for automated retrieval of new images when they become available on the Open-I website.

Determining diseases to detect
Differentiating diseases based on only a thoracic rontgen scan is difficult. Radiologist often need more information about the patient such as sex, age and prior medical history to determine a diagnosis. Additionally the operating physician needs to ask the radiologist if a specific disease can be determined, if there is no possible indication the radiologist cannot determine a diagnosis. Therefore, this study needs to have a focus on classifying diseases which are easily distinguishable.

To obtain this knowledge I have had an interview with a medical expert with radiology knowledge.

From this, the following diseases have been annotated to be classified by the model:

  1. Atelectasis
  2. Cardiomegaly
  3. Effusion
  4. Infiltration
  5. Mass
  6. Nodule
  7. Pneumonia
  8. Pneumothorax
  9. Consolidation
  10. Edema
  11. Emphysema
  12. Fibrosis
  13. Pleural thickening
  14. Hernia

Data storage
Retrieved data can be requested from the Kafka server through a consumer. This consumer will sent the retrieved data to a master node which will assign data to its nodes. These nodes will assign a directory for storage for the image and save additional information about the image as well as the path of the image in Elasticsearch. Using a RESTful API, Elasticsearch saves data and indexes it automatically. This will allow for us to perform searches using JSON objects as parameters.

First step to preprocessing of images is to ensure that the images all have the same size and aspect ratio. Additionally, all the images have to be scaled appropriately. Medical images often have very high dimensionality. In clinical practice, radiology image matrices may vary from 64 x 64 for some nuclear medicine exams, to over 4000 x 5000 for some mammogram images [14]. To normalize the pixels we do have to calculate the mean and standard deviation. A normalized pixel is calculated by subtracting the mean and dividing my the standard deviation. A recent study [15] showed that Zero Component Analysis, a technique used to standardize, has the greatest positive influence as a preprocessing technique for image classification.

Training and validation
The pre-processed rontgen scans will be randomly split into a training and validation set at a ratio of 80:20. This ratio has been picked to have sufficient training data and enough validation data. If our data will be sufficiently labeled, we will train our model supervised. Else, we will be forced to unsupervised learning.

For experts and researchers to obtain knowledge from our model a suitable visualization technique will have to be implemented. We will implement a heatmap that is placed over the rontgen image to indicate hot spots. If there is enough time left, we would like to implement instance segmentation.

Improved model working in Genematics Harvey
When the final model will be finalized, it still has to be implemented into Genematics Harvey for production. The output will have to be clear and self-explanatory for the user. Additionally, we are seeking to make an export function that allows the user to export their findings in JPG, PNG and PDF.