Most radiology AI pilots fail — or more precisely, most fail to convert into production deployments — for reasons that have nothing to do with the clinical performance of the tool. They stall on a DICOM routing configuration that nobody owns, a BAA that's been in legal review for three months, or a radiologist workflow that wasn't mapped before go-live and turns out to be incompatible with how the tool delivers annotations.
We've been through enough early evaluations to recognize the patterns. This checklist reflects what we've learned about what needs to be verified before a pilot starts, if you want the pilot to produce a decision rather than a deferral. It's written for two audiences who need to work together: clinical (section heads, lead radiologists, medical directors) and IT (PACS administrators, integration engineers, information security).
Neither audience can complete this checklist alone. The integration questions require clinical input on workflow and clinical questions require IT input on feasibility. If your evaluation team doesn't include both functions from the start, plan on a slower process.
Section 1: Regulatory and Compliance Baseline
Before evaluating any clinical functionality, confirm the legal and compliance posture of the tool. This work belongs to compliance and legal, but clinical leadership needs to drive it and set a deadline — otherwise it drifts.
- Business Associate Agreement (BAA). Is a BAA in place or in active negotiation? HIPAA requires a BAA before any PHI is transmitted to or processed by a third-party vendor. If the vendor processes DICOM images in their cloud infrastructure, PHI is involved. Do not start a pilot without a signed BAA — this is not a bureaucratic formality, it's a legal requirement with breach liability implications.
- Regulatory classification. Is the tool 510(k) cleared? If yes, what is the cleared Indications for Use, and does your intended use case fall within those indications? If not cleared, what is the vendor's regulatory rationale for the classification they've chosen? Get this in writing before the pilot starts.
- Data processing agreement and DPA/international data transfer. Where does patient data go during processing? On-premise, vendor cloud, or hybrid? If cloud-based, in which data regions? For US institutions this mainly affects HIPAA compliance; for institutions with EU patients or research collaborations, GDPR implications may also apply.
- Vendor security assessment. Most hospital IT security teams have a standard security questionnaire for third-party clinical software vendors. Complete this before, not after, the pilot. A SOC 2 Type II report from the vendor is a reasonable baseline; if the vendor doesn't have one, understand why and what compensating controls they document.
- Institutional IRB review (if applicable). If the pilot involves systematic data collection about radiologist performance or diagnostic outcomes for publication or research purposes, IRB review may be required. For a pure operational pilot with no research intent, this typically doesn't apply — but clarify this with your compliance office early.
Section 2: Technical Integration Prerequisites
This section is for PACS administrators and integration engineers, with input from the radiologists who will use the tool.
- DICOM routing architecture. Map how DICOM studies currently flow from modality to PACS. Where in this flow will the AI tool receive studies? Typical integration patterns are: (a) the PACS forwards a copy of completed studies to the AI processing node via a DICOM C-STORE; (b) a DICOM router intercepts studies post-reconstruction and forwards to both PACS and AI in parallel; (c) the AI system polls the PACS for new studies via DICOM C-FIND/C-MOVE. Clarify which model the vendor supports and whether your current infrastructure can implement it without a new router appliance.
- DICOM tag completeness. AI tools typically require specific DICOM tags to function correctly — at minimum: StudyInstanceUID, SeriesInstanceUID, SOPInstanceUID, Modality, PatientID, StudyDate. Some tools also require PatientAge, PatientSex, and acquisition parameters (SliceThickness, KVP, ConvolutionKernel). Run a DICOM conformance check on a representative sample of your studies before assuming tag completeness. Scanner models and RIS configurations that truncate or omit tags will cause parsing failures that appear as tool errors rather than data problems.
- DICOM SR output format. AI annotation tools that return results to the PACS typically do so via DICOM Structured Reports (SR) or DICOM overlays. Confirm that your PACS can receive and display DICOM SR objects from third parties — some older PACS versions have SR rendering limitations. If the tool uses a proprietary overlay format, understand whether that requires a PACS plug-in and whether the plug-in is compatible with your PACS version.
- Network and firewall configuration. If the tool processes studies on vendor infrastructure, outbound DICOM or HTTPS traffic to the vendor's endpoint needs to be permitted through your network security controls. Get the specific IP ranges and port requirements from the vendor and submit the change request before the pilot kick-off date — firewall changes at most hospital IT departments take 1-3 weeks to process.
- Processing latency requirements. Define the acceptable latency from study completion to annotation availability in the PACS. For a triage tool on suspected intracranial hemorrhage, 5-minute latency may be clinically meaningful. For a chest CT nodule annotation tool used in a screening context, 30-minute latency is probably acceptable. Confirm the vendor's typical processing times match your clinical use case, and understand what happens to annotations when processing is delayed (queue, fallback, alert).
- Fallback and failure handling. What happens when the AI tool is unavailable — planned maintenance, network outage, or processing failure? Does the PACS hold studies pending annotation, or does it release them to the worklist without annotation? The correct answer for almost all clinical workflows is the latter: annotation is a decision-support overlay, not a gate. Verify this is how the tool behaves.
Section 3: Clinical Workflow Mapping
This section is primarily for clinical leadership, with IT input on what's technically configurable.
- Worklist presentation. Where and how will annotations appear to radiologists? In the hanging protocol? As a separate series in the study? As a floating overlay in the viewer? Map this against your actual PACS hanging protocols before the pilot. An annotation that appears in a non-standard location or interrupts an established hanging protocol will generate radiologist friction regardless of clinical value.
- Read-first vs. annotation-first workflow. Have a deliberate conversation with your lead radiologists about whether annotations should appear before the radiologist opens the primary series (annotation-first, for maximum time efficiency) or only after the primary interpretation is complete (read-first, to preserve diagnostic independence). There are genuine arguments for both. Deciding in advance prevents the workflow from defaulting to whatever is easiest to configure.
- Subspecialty applicability. Define which study types and body regions will be included in the pilot. A chest CT nodule detection tool running against your full PACS throughput will surface annotations on chest CT studies ordered for a wide range of indications — CTPA, trauma, staging, follow-up of known disease. Confirm the tool's performance has been characterized on the study types in your pilot scope, not just the primary indication for which it was validated.
- Radiologist training and orientation. Plan a minimum of 30 minutes of structured orientation for each radiologist who will use the tool during the pilot — not a vendor demo, but a walk-through of the specific workflow with your PACS in your environment. Document what the annotation interface shows, how to accept and reject annotations, and what to do when the annotation appears to be incorrect. If radiologists encounter an incorrect annotation on a Monday morning with no training context, the pilot will produce negative feedback that reflects the training gap rather than the tool's actual performance.
- Feedback mechanism. Decide how radiologists will record cases where the annotation was incorrect, missing, or clinically misleading. This can be as simple as a shared log or a structured data field in the reporting tool. Without a feedback mechanism, the pilot generates impressions rather than data.
Section 4: Pilot Success Criteria — Define These Before You Start
The most common reason a pilot produces a deferral rather than a decision is that success criteria weren't defined at the start. Define them now, in writing, before the first study runs through the system.
- Pilot duration and volume. How many studies need to run through the tool, over how many weeks, before the pilot is considered complete? A meaningful chest CT nodule pilot probably needs at least 300-500 chest CT studies and 4-6 weeks of real-world use. A pilot that runs for two weeks and covers 50 studies cannot produce valid performance data.
- Primary clinical outcome metric. What will you measure to evaluate the tool's clinical value? Radiologist-reported annotation accuracy rate? Change in mean read time per study? Structured documentation completeness for incidental nodule findings? Pick one primary metric and define a threshold for what "success" looks like.
- Integration stability threshold. What percentage of studies must receive annotations within the defined latency window for the integration to be considered stable? Define this upfront — a common starting point is 95% of studies annotated within the specified latency during production hours.
- Decision gate. Who makes the go/no-go decision at the end of the pilot, on what date, based on which data? Name the decision-makers and the decision date before the pilot starts. A pilot without a defined decision date drifts.
This checklist won't make a radiology AI pilot effortless — that's not what checklists are for. What it does is front-load the work that most commonly causes pilots to stall mid-way or conclude without a clear outcome. The technical integration questions and the compliance requirements are deterministic: they either pass or they don't, and discovering they don't during a live pilot is more expensive than discovering it before the pilot starts. The clinical workflow questions are softer, but they determine whether the tool generates useful feedback data or just generates two weeks of frustrated radiologists.
We try to run through a version of this checklist ourselves before committing to a formal evaluation with any department. Not because we're worried about our own integration — but because a pilot that stalls on someone else's BAA process or DICOM router configuration produces bad signal for both sides.