DermoAI — Engineering Trustworthy Medical AI

01 — The Problem / domain shift

The Gap Between
Research and Reality

Models that shine on curated clinical photography collapse the moment they meet a real consumer device. This is the wall DermoAI was built to break.

Laboratory Conditions

Clinical Images

✓Controlled, even lighting
✓High-resolution dermoscopy
✓Isolated, clean backgrounds

Real-World Conditions

Smartphone Images

✗Motion blur & missed focus
✗Hard shadows & glare
✗JPEG compression artifacts
✗Wild lighting variation

When a lab-grade model meets a phone camera

78.65%

↓

27.59%

A 51-point collapse in accuracy. Most AI models fail the moment they leave the laboratory.

02 — Engineering Journey / five phases

From 68% to a
Deployable Model

A five-phase R&D cycle. Each phase was built to uncover and remediate a specific failure mode found during validation — moving from a fragile academic model to a deployment-ready triage platform.

Phase 01 · Baseline

ResNet-50 Baseline

A pretrained ResNet-50 established the floor. It captured broad structure but missed the fine-grained texture and micro-vascular patterns that separate malignancy from benign lookalikes.

Validation accuracy68.00%

Phase 02 · Architecture

ConvNeXt-Base Upgrade

The backbone moved to ConvNeXt-Base (88M parameters) — transformer-inspired design with LayerNorm and GeLU, retaining CNN training efficiency. Texture extraction improved sharply.

Validation accuracy77.00%

Phase 03 · Integrity

Patient-Level Data Leakage

Train–validation contamination was discovered: images from the same patient spanned both splits, inflating every metric. The validation pipeline was rebuilt with GroupShuffleSplit for true patient-level separation.

ResultTrustworthy evaluation

Phase 04 · Discovery

Smartphone Collapse

Testing on the Midas consumer-photo set exposed the real wall: 78.65% lab accuracy collapsed to 27.59% on phones — a 51-point drop traced to lighting, focus, resolution, and background skin texture.

Real-world accuracy27.59%

Phase 05 · Adaptation

Dataset Expansion & Camera Augmentations

2,308 smartphone images were folded into training alongside camera-specific augmentations — color jitter, Gaussian noise, motion blur, and perspective warps — forcing camera-invariant representations.

Smartphone accuracy recovered35.58%

Phase 05 · Refinement

Supervised Contrastive Learning

Supervised contrastive learning pulled embeddings of same-class lesions together and pushed lookalikes apart, sharpening the malignant–benign decision boundary.

Phase 05 · Robustness

Test-Time Augmentation & Ensembling

TTA averaged predictions over augmented views, and ensembling combined complementary backbones — stabilizing predictions on noisy inputs.

Result · Final Model

Production-Ready Triage

A safety-first ConvNeXt-Base triage engine, calibrated for real-world smartphone input, validated across six diverse datasets including consumer photography.

Final multi-class accuracy82.79%

03 — Engineering Challenges / what we solved

Six Walls We
Had to Break

Each card maps to a real failure mode uncovered during validation — and the engineering decision that resolved it.

Patient-Level Data Leakage

Same-patient images across train and validation silently inflated every metric. Solved with GroupShuffleSplit.

Smartphone Domain Shift

Clinical photography and phone cameras are different visual languages. Bridged with 2,308 real photos and camera-specific augmentations.

Class Imbalance

Rare malignancies were drowned by common benign lesions. Addressed with class balancing, focal loss, and melanoma-specific handling.

GPU Memory Constraints

88M-parameter backbones and heavy augmentation pushed memory limits. Managed with gradient accumulation and mixed-precision training.

Safety vs Precision

Missing a cancer is far worse than a false alarm. We deliberately biased toward sensitivity — 87.30% — over strict top-1 accuracy.

Model Calibration

Raw softmax confidence was overconfident. Label smoothing (ε=0.1) and a conservative triage threshold restored trustworthy probabilities.

04 — System Architecture / end-to-end

From Photo to
Clinical Report

A modular, high-throughput pipeline — from a patient's phone to a structured, safety-calibrated clinical report.

📷

Upload Smartphone Imagepatient-facing capture

01

⚙️

Image Preprocessingresize 224×224 · normalize

02

🧠

ConvNeXt-Base Feature Extraction88M params · texture & boundary

03

🎯

Disease Classification12-class differential

04

🚩

Cancer Triage Logicsum-probability risk threshold

05

📋

Clinical Reportrecommendations + Grad-CAM

06

05 — Results / validated on 4,555 images

Validated. Not
Just Trained.

Evaluated on a clean, patient-separated set of 4,555 images across 12 disease classes — including real-world smartphone photography.

Overall

0%

Multi-class accuracy across all 12 disease classes.

Triage

0%

Cancer sensitivity — 87 of every 100 malignancies correctly flagged.

Specificity

0%

True-negative rate — keeps unnecessary referrals manageable.

12 Disease Classes Real Smartphone Evaluation Robust Domain Adaptation Patient-Level Split Integrity

06 — Live Demo / see it run

Watch It Triage
a Real Lesion

From smartphone capture to a safety-calibrated clinical report — end to end.

dermoai · live demo

▶ Demo · 2 min

07 — Lessons Learned / what stuck

What This Project
Actually Taught Us

→ 01

Engineering

Architecture alone gets you to 77%. The real gains come from data integrity, augmentation, and calibration — the unglamorous work.

→ 02

Validation

Metrics without patient-level splitting are fiction. The 9-point jump to 82.79% only counted once leakage was eliminated.

→ 03

Generalization

A model isn't deployed until it survives a phone camera. Domain adaptation is not optional — it is the product.

→ 04

Trust

In medicine, a missed cancer is not a metric — it's a person. Safety-first design beats leaderboard accuracy every time.

08 — Future Roadmap / what's next

The Path Forward

01

Grad-CAM

Localized attention heatmaps for clinician trust.

02

Larger Phone Set

Expand the smartphone training corpus.

03

Mobile Optimization

INT8/FP16 quantization via ONNX for on-device inference.

04

Clinical Validation

External dermatologist-supervised trials.

05

Foundation Models

Swin Transformer ensemble & self-attention backbones.

The Gap BetweenResearch and Reality

Clinical Images

Smartphone Images

From 68% to aDeployable Model

ResNet-50 Baseline

ConvNeXt-Base Upgrade

Patient-Level Data Leakage

Smartphone Collapse

Dataset Expansion & Camera Augmentations

Supervised Contrastive Learning

Test-Time Augmentation & Ensembling

Production-Ready Triage

Six Walls WeHad to Break

Patient-Level Data Leakage

Smartphone Domain Shift

Class Imbalance

GPU Memory Constraints

Safety vs Precision

Model Calibration

From Photo toClinical Report

Validated. NotJust Trained.

Watch It Triagea Real Lesion

What This ProjectActually Taught Us

Engineering

Validation

Generalization

Trust

The Path Forward

Grad-CAM

Larger Phone Set

Mobile Optimization

Clinical Validation

Foundation Models

The Gap Between
Research and Reality

From 68% to a
Deployable Model

Six Walls We
Had to Break

From Photo to
Clinical Report

Validated. Not
Just Trained.

Watch It Triage
a Real Lesion

What This Project
Actually Taught Us