If you have evaluated AI radiology tools in the past few years, you have seen a number like "94% accuracy" or "96% sensitivity" in marketing materials. The number sounds authoritative. It implies the tool detected findings correctly in 94 out of 100 cases. But without the surrounding context — dataset composition, size threshold, class definition, comparison baseline, operating point on the ROC curve — the number communicates almost nothing useful to a clinical buyer.
This is not unique to AI radiology. Clinical diagnostics has struggled with accuracy number transparency for decades. But the problem is particularly acute in AI tool evaluation because the numbers appear in contexts where they are not subject to the scrutiny they would receive in a peer-reviewed journal. A vendor slide deck is not a methods section. Radiology department heads evaluating tools deserve a clearer map of what these metrics actually mean and what questions to ask.
Accuracy as a Metric Is Often the Wrong Choice
Start with the metric itself. "Accuracy" in a classification context means: of all cases, what fraction did the model classify correctly? This makes sense when the classes are balanced — roughly equal numbers of positive and negative cases. In pulmonary nodule detection, classes are not balanced. On a typical chest CT population, a significant fraction of scans show no actionable nodule at all, and within nodule-containing scans, clinically significant findings (nodules above Fleischner threshold sizes) represent a smaller fraction still.
In a class-imbalanced dataset, a model that predicts "no significant nodule" for every case might achieve 80–85% accuracy purely because the negative class is large. That number means nothing about the model's ability to find the nodules that matter. Sensitivity and specificity decompose the performance properly: sensitivity tells you how often the model correctly identifies a true nodule (true positive rate); specificity tells you how often it correctly passes on a scan that has nothing significant (true negative rate). Neither alone is sufficient.
The metrics a buyer should request:
- Sensitivity at a specified size threshold. Detection sensitivity for all nodules ≥2mm is a very different number from sensitivity for nodules ≥6mm. A model optimized for large nodules will look better on headline sensitivity metrics. The Fleischner Society's 2017 guidelines set different management thresholds by size — which threshold matters most depends on your clinical use case.
- False positive rate per scan (FPPS). How many spurious annotations does the model generate per study? A model with high sensitivity but 6 false positives per scan creates more work than it saves. FPPS is the metric AI vendors least like to report.
- Specificity at the operating threshold used in the reported accuracy number. Models can be tuned to trade sensitivity for specificity or vice versa. A vendor who reports sensitivity without reporting the specificity at the same operating point is showing you one end of a tradeoff, not a complete performance picture.
Dataset Composition Determines Everything
The second critical variable is where the accuracy number came from. The best way to understand a performance claim is to understand the dataset it was measured on:
Scanner and acquisition parameters. CT chest images vary significantly by slice thickness, reconstruction kernel, tube current, and scanner manufacturer. A model trained and tested on thin-slice (0.625mm or 1.25mm) helical CT from one manufacturer may degrade meaningfully on 2.5mm or 5mm reconstructions common in older PACS archives or community hospital scanners. If the vendor's dataset was acquired entirely on high-end 256-slice scanners with thin-slice protocols, performance on a mixed community-hospital scanner population could be 5–15 percentage points lower.
Nodule prevalence and composition. Prevalence in the test set affects how positive predictive value is calculated. A test set constructed to include 40% nodule-positive scans (to ensure statistical power in the evaluation) will produce PPV numbers that do not apply in a population where only 10–15% of CT chest studies contain actionable nodules. PPV in deployment will be lower than PPV in a prevalence-enriched test set — sometimes substantially.
Ground truth annotation method. Was the ground truth annotated by one radiologist, two in consensus, or three with majority voting? Radiologist-radiologist agreement on small nodule detection is imperfect. A model evaluated against a single annotator's labels will show different performance than one evaluated against a consensus standard. The LUNA16 benchmark, which is a common reference dataset in nodule detection research, used a 4-radiologist consensus standard — models trained and tested on LUNA16 are being tested against a harder reference standard than models tested against single-annotator labels.
What We Report for Neurmorph — and Why We Frame It This Way
On our product page, we cite 94% annotation accuracy on chest CT. We want to be specific about what that means in our context:
The 94% figure refers to annotation agreement rate between Neurmorph's output and two-radiologist consensus review, measured on a held-out set of chest CT studies processed through a range of scanner acquisition parameters including slice thicknesses from 1.25mm to 3.0mm. The test set included both nodule-containing and nodule-negative studies. "Agreement" counts an annotation as correct when the Neurmorph bounding box overlaps the consensus-annotated region with an IoU (intersection over union) ≥ 0.4, at nodule diameter ≥ 2mm.
What this number does not cover: nodules smaller than 2mm (we don't claim detection below that threshold), non-nodule findings such as pleural effusion or mediastinal widening where we use separate detection logic, and performance on non-chest modalities where Neurmorph is not currently deployed.
We also publish our false positive rate: approximately 1.2 spurious annotations per study on average across the test set. That number is real and affects workflow — a radiologist reviewing a Neurmorph-annotated scan will encounter about one mark per study that doesn't correspond to a clinical finding and should be dismissed. That dismiss action takes 3–5 seconds. We think that tradeoff is acceptable given that the true positive annotations save significantly more time, but a buyer should know the false positive rate before deciding.
The Right Way to Evaluate a Nodule Detection Tool
The most reliable evaluation is a prospective pilot on your own scanner population. This is what we offer through our pilot tier — 50 deidentified chest CT studies annotated, returned with DICOM SR, reviewed by your radiologists against their own reads. The number that matters is the one your radiologists calculate on your studies, not the number in our materials.
During a pilot, the metrics worth tracking:
- True positive rate on nodules ≥4mm (the clinical action threshold for most Fleischner follow-up decisions)
- False positive rate per study — how many annotations did the radiologist dismiss as spurious?
- Miss rate — how many nodules did the radiologist find that Neurmorph did not flag?
- Prior comparison accuracy — when prior series were available, was the growth rate delta calculated correctly?
These numbers on your population, with your scanners and your radiologist review, are more informative than any benchmark metric in a vendor data sheet. We say this as the vendor. The pilot data is what should determine whether Neurmorph is worth deploying in your department — not our claimed performance numbers.
A Note on Sensitivity vs. Specificity Tradeoffs in Clinical Context
One more thing worth stating plainly: the right sensitivity/specificity operating point for a pre-annotation tool is not the same as for a standalone diagnostic tool. Neurmorph is not making diagnostic decisions. The radiologist reviews every annotation. In that context, we can tolerate a slightly higher false positive rate in exchange for better sensitivity — a dismissed spurious annotation costs 4 seconds; a missed nodule that a radiologist then catches costs 30 seconds; a missed nodule that neither the AI nor the radiologist catches is a clinical failure.
We are not saying high false positive rates are acceptable — that claim would be dishonest given the workflow impact. We are saying the calculus is different for a pre-annotation tool in a human-in-the-loop workflow than for a standalone automated detection system. When evaluating competing tools, that architectural distinction should inform how you weight sensitivity versus specificity in your evaluation criteria.