Nilesh Sarkar / Research

Deep Learning for Medical Imaging: PCOS Detection

Research Objective

A systematic deep-learning framework for automated binary PCOS detection from ovarian ultrasound images, benchmarking 18 modern vision architectures (CNNs + Vision Transformers + hybrids) under identical experimental conditions.

Status: Published in journal. Title: "A Systematic Deep Learning Framework for PCOS Detection Using Deduplicated Ultrasound Images: Comparative Analysis of CNN and Vision Transformer Models".

Model Training & Validation

A core part of my research involves training generative models to create synthetic medical images that closely resemble real scans. This expands the training data and makes the final detection models much more robust, especially when labelled data is scarce.

The synthetic generation process mimics complex features of medical imaging. Below are some of the samples my model generated during the initial training phase.

By using these techniques, I'm aiming to:

CNN vs Transformer - Architecture Comparison

For PCOS ultrasound classification I'm running a head-to-head between classical CNN stacks (progressive hierarchical features, local-to-global) and Vision Transformers (patch tokenisation + global self-attention). Both ingest the same ultrasound input, but they see the ovary very differently.

Three-stage deduplication pipeline

The public PCOS-XAI ultrasound dataset (11,784 images) had substantial duplicate and label-conflict noise. A novel three-stage pipeline cleaned it:

  1. MD5 cryptographic hashing - removes byte-identical duplicates.
  2. Perceptual hashing (pHash) + Hamming distance - removes near-duplicates (re-encoded / resized copies).
  3. Cross-class duplicate removal - drops images that appear in both PCOS and non-PCOS classes (label conflicts).

The pipeline removes 8,294 images (70.4%) and leaves a high-quality, unambiguously-labelled dataset of 3,490 images. Every architecture below was trained on the same deduplicated split, with ImageNet-pretrained transfer learning for 200 epochs.

Final test results - all 18 architectures after 200 epochs

All eighteen models (13 CNNs + 5 ViTs) were trained under standardised conditions for 200 epochs, with metrics logged at fixed intervals (50, 100, 150, 200). Reported below are the final test-set accuracy, F1-score, and AUC-ROC.

Model Loss Accuracy F1 AUC-ROC
EfficientFormer-L10.01830.99810.99810.9999
MobileViT-Small0.00210.99810.99811.0000
ResNet340.03830.99620.99620.9998
DenseNet1690.01790.99620.99621.0000
NextViT-Small0.01040.99620.99620.9999
ResNet500.03180.99430.99431.0000
DenseNet1210.04670.99430.99430.9996
EfficientNet-B00.10170.99430.99430.9975
EfficientNet-B30.06550.99430.99430.9996
ResNet180.04250.99240.99240.9995
MobileNetV20.06400.99050.99050.9998
MobileNetV3-Large0.06240.99050.99050.9997
InceptionV30.06370.99050.99050.9991
Xception0.07090.99050.99050.9996
VGG190.04060.98660.98670.9986
ViT-Base Patch16/2240.31390.82630.81230.8993
VGG160.53340.77480.67650.5000
Swin Transformer Base0.53340.77480.67650.5248

Convergence trajectory (ResNet34, the top CNN)

To illustrate how the best pure-CNN converges across the 200-epoch schedule, ResNet34's training and validation metrics are shown at each checkpoint:

Epoch Train Loss Train Ac Val Loss Val Ac
50 0.00460.99840.02890.9943
100 0.00660.99750.02220.9962
150 0.00160.99920.03030.9962
200 0.00021.00000.03100.9962
Final test0.0383Ac 0.9962 · F1 0.9962 · AUC 0.9998

Five principal findings

1. Data quality (novel deduplication pipeline)

The three-stage MD5 + pHash + cross-class pipeline cleaned the largest public PCOS-XAI ultrasound dataset, revealing and removing 70.4% (8,294 images) of corrupt data, including exact duplicates, near-duplicates, and label conflicts. This re-frames previously reported high-accuracy results and establishes a rigorously cleaned 3,490-image dataset as the new reliable baseline.

2. Superiority of efficient hybrid Transformer models

EfficientFormer-L1 and MobileViT-Small both hit 99.81% test accuracy (AUC up to 1.0000), surpassing every one of the 13 CNN architectures evaluated. Hybrid designs that combine CNN local feature extraction with Transformer global context modelling are clearly best-suited for moderate-size medical imaging datasets.

3. Failure of standard Transformer architectures

Pure attention models ViT-Base (82.63%) and Swin Transformer Base (77.48%) exhibited convergence failure on this dataset size, performing near chance for the minority class. This confirms that pure attention-based models are unsuitable for smaller real-world medical datasets without massive pre-training.

4. Comprehensive architectural benchmarking

The first large-scale, systematic comparison of 18 architectures (13 CNNs + 5 ViTs) for PCOS detection, all trained and evaluated under rigorously standardised conditions. The result is a fair, reproducible cross-architecture baseline that had previously been absent due to inconsistent protocols in earlier studies.

5. Evidence-based deployment guidance

Concrete recommendations:

Tags