Mitigating AI Risks in Healthcare: Why Local Validation Matters

Published: Jun 4, 2025
By: Jen Clark; Melissa Pun; Rithik Srinivasan
Topics: Artificial Intelligence
Share

At EisnerAmper, we’ve developed a comprehensive framework to help healthcare organizations evaluate and deploy AI technologies safely and effectively. Drawing from national standards like NIST, CHAI, and TRAIN, as well as our own deep experience in healthcare risk and safety, we’ve curated over 100 best practice topics tailored to different levels of AI maturity.

This framework focuses on critical areas such as governance, technology integration, financial risk, clinical controls, model testing, transparency, and ongoing monitoring, empowering health systems to reduce AI-related risk while accelerating adoption. Model testing plays a pivotal role, providing the foundation for validating performance, resilience, and clinical relevance in the local care environment.

As artificial intelligence becomes more embedded in healthcare, from diagnostics to clinical decision support to ambient listening, health systems must take a more rigorous, structured approach to testing and evaluating AI models locally before deploying them in clinical workflows. With an initial focus on reducing administrative burden on clinicians through reliance on model outputs, validating the local environment in which the models operate becomes critical as the use evolves to deeper clinical decision-making algorithms. Too often, vendors present strong performance claims based on datasets or conditions that don’t reflect the local institution's diversity, complexity, or operational realities.

The Need for Local AI Model Evaluation in Healthcare

AI models aren’t plug-and-play. A model that performs well in a vendor's environment may underperform—or even introduce safety risks—once deployed at a specific healthcare site. For example, Epic’s Sepsis Model, implemented across many U.S. hospitals, significantly underperformed when used at Michigan Medicine, missing two-thirds of actual sepsis cases and generating a high number of false positives. This discrepancy was largely due to differences in patient populations, clinical practices, and EHR documentation. Such outcomes highlight how variations in patient demographics, data quality, care pathways, and technical infrastructure can affect model behavior, making local validation essential to maintain safety, fairness, and trust.

Four Core Areas for Healthcare AI Evaluation

To enable the safe and effective deployment of AI in healthcare, local evaluation should focus on these four areas:

Software Quality

Evaluation should go beyond model accuracy. Healthcare systems need to make sure the system is reliable, responsive, and integrates well into clinical systems. This includes checking for crashes, slowdowns, data handling issues, and testing real-world workflows. Testing should simulate actual clinical scenarios, not just isolated model performance.

Sample Key Metrics to Monitor:

System Uptime (%): The percentage of total operating time that the AI system is available and functional (high uptime indicates high reliability with minimal downtime).
Response Time (Latency): The average time (in seconds) the system takes to return a result or decision. This affects real-time usability in clinical workflows (lower latency
means faster responses for clinicians).
Throughput (Cases per Hour): The number of patient cases or data instances the AI can process in an hour, reflecting the system’s capacity to handle volume (higher throughput supports busy clinical environments).

Pressure Testing

Assess the model's performance when faced with noisy, unusual, or compromised inputs. This “failure” or “pressure” testing goes beyond traditional cybersecurity concerns. It addresses everyday data problems like low-quality scans, incomplete patient histories, or new population characteristics. Understanding the model’s resilience to these real-world data challenges is necessary for reliable clinical use.

Sample Key Metrics to Monitor:

Robust Accuracy: The AI model’s accuracy when tested on perturbed, noisy, or intentionally manipulated inputs indicates its ability to maintain performance under less-than-ideal or malicious conditions (robust systems continue to make accurate diagnoses despite variations or attacks).
Failure Success Rate: The percentage of failure attempts (e.g., intentionally altered images or data designed to trick the AI) that succeed in causing incorrect outputs. A lower success rate means the AI is more resilient to manipulation (a high-resilience system rarely falls for malicious input tweaks).
Failure Detection Rate: How often the system detects or flags suspicious inputs that might be out-of-scope. Measured as a percentage of known failure inputs correctly identified, this reflects the AI’s ability to recognize when it is being fed compromised data and potentially refuse or warn on those inputs.

Fairness and Bias Mitigation

An AI model trained predominantly on data from one population may exhibit biased or suboptimal performance when applied to another. Thorough evaluation must include assessing performance across demographics and comparing local population data with the vendor’s training dataset. Identifying and mitigating potential biases helps maintain equitable healthcare delivery.

Sample Key Metrics to Monitor:

Training–Local Population Divergence: This measures how different the local patient population is from the AI’s training population. It can be quantified by statistics like the Population Stability Index (PSI) or other distribution overlap metrics. A low divergence means the vendor’s model was trained on data closely resembling the local cohort, reducing the risk of bias or performance drop when deployed.
Demographic Performance Parity: The degree to which the AI’s accuracy or error rates are consistent across different demographic groups (e.g., age, sex, ethnicity). This can be expressed as the difference between groups’ performance – a near-zero difference signifies the model performs equitably and avoids significant performance disparities between patient subgroups.
Sensitivity and Specificity Parity: Ensuring the model’s true positive rate (sensitivity) and true negative rate (specificity) are similar for different groups. For example, suppose the sensitivity of one subgroup is 95% and that of another is 90%. In that case, the gap (5%) is the disparity – smaller gaps indicate better fairness (the AI catches diseases equally well across populations and is equally careful about false alarms).

Safety and Accuracy

Evaluating the model’s accuracy is fundamental but must be considered alongside clinical safety thresholds and acceptable error tolerances. How does the AI’s performance compare to that of experienced clinicians in similar tasks? How does the model behave under uncertainty – does it over-predict or under-predict? Does it “know when it doesn’t know”?

Sample Key Metrics to Monitor:

Sensitivity (True Positive Rate): The proportion of patients with a condition that the AI correctly identifies (e.g., it finds 95 out of 100 cases of a disease, sensitivity = 95%). High sensitivity means the AI catches most true cases, aligning with clinicians’ aim not to miss diagnoses (sensitivity is among the most commonly monitored metrics for clinical AI.
Specificity (True Negative Rate): The proportion of patients without the condition that the AI correctly clears as negative. For example, if out of 100 healthy individuals, the AI erroneously flags 5, specificity = 95%. High specificity indicates the AI avoids false alarms, so it doesn’t overburden clinicians or needlessly worry patients with incorrect positives.
Positive Predictive Value (PPV): Also known as precision – the percentage of AI-positive results that are truly positive. For instance, if the AI labels 20 patients as high-risk and 15 truly have the condition, PPV = 75%. A higher PPV means that when the AI says “problem,” it’s usually correct. Maintaining clinicians’ trust and avoiding alert fatigue is crucialhttps://pmc.ncbi.nlm.nih.gov/articles/PMC11630661/#:~:text=Accuracy measures were the most,described,53 with the argument that).

Setting Expectations for AI Vendor Partnerships in Healthcare

Healthcare systems should actively push for greater transparency and collaboration from AI vendors. This includes receiving:

Clear and comprehensive model cards or documentation that describe training data, limitations, expected use cases, and potential biases.
Detailed subgroup performance metrics that provide insights into how the model performs across different patient demographics, not just aggregate accuracy.
Access to test the model in their own setting, using their own data before deployment.
A genuine collaborative approach between the vendor and health system to proactively identify and effectively mitigate performance gaps or potential issues that arise during local evaluation.

Achieving Repeatable AI Testing at Scale in Healthcare

AI has the potential to support the next generation of care. However, without robust evaluation, it can amplify harm, introduce unseen biases, or create new patient risks.

The increasing volume and complexity of AI models entering healthcare necessitate scalable and repeatable evaluation frameworks and processes. By establishing standardized approaches to local testing and demanding transparency from vendors, healthcare systems can confidently implement AI solutions while safeguarding patient well-being.