Verbal2025

At this point it should go without saying, but let's say it anyway: AI is transforming our world.

And nowhere is that more evident than in healthcare.

But while tools like AI scribes have already helped ease healthcare’s administrative burden and improve efficiency, all too often we fail to ask: Is the work being “done well” or simply being “done?”

We all know quality assurance (QA) is key to patient safety and experience, so why haven’t organizations applied that same level of rigor to their AI implementations?

Instead of taking a “fingers crossed” approach to AI in healthcare — implementing and hoping for the best — it’s time we dig deeper.

It’s time we do some QA on our AI.

The promise and pitfalls of AI in clinical ops

AI is already revolutionizing healthcare operations, reshaping everything from patient communications and scheduling to staff training and billing. But of all these changes, AI's impact on documentation is arguably the most visible.

AI-powered documentation tools and ambient scribes are among the most popular implementations of AI in healthcare today. These tools can not only transcribe and summarize conversations between providers and patients, but also generate drafts of clinical notes and complex documentation.

And considering the enormous burden of administrative work like documentation in healthcare, the potential benefit of automation is obvious.

So far, AI has done a great job of addressing these challenges, with studies consistently finding that AI scribes and AI-powered documentation save physicians time, improve patient interactions and boost staff satisfaction.

In fact, one recent study found that AI scribes not only significantly reduced off-hour work for providers — also called “pajama time” — but also that those who used AI scribes more frequently had more significant reductions in this work. In other words, the more frequently providers used an AI scribe, the more time they saved.

But I think one question may not be asked enough: How good of a job is AI actually doing?

While the ability of these tools to even perform such complex tasks is incredibly impressive, “done” doesn’t necessarily mean “done well.”

Beyond ensuring patient safety and minimizing organizational risk, we need to confirm the accuracy of AI outputs to ensure they are truly saving time. After all, if providers have to spend hours reviewing outputs to ensure accuracy and quality, doesn’t that defeat the purpose of automation?

AI scribes: Far from “set and forget”

So how are AI scribes doing? Let’s take a look.

One recent study on using AI scribes for clinical documentation explored how these implementations could be used to maintain the quality and accuracy of clinical documentation while also “identifying, minimizing, or mitigating potential safety risks introduced by the use of the AI scribe technology.”

While they generally performed well, AI scribes were far from perfect.

For example, one AI summary noted that a physician had performed a prostate exam when really the physician had only mentioned scheduling a prostate exam. In a different conversation, a physician discussed issues with the patient’s hands, feet, and mouth, but the AI summary said the patient had been diagnosed with hand, foot, and mouth disease!

Another study, which compared AI transcription hallucination rates between speakers with aphasia — who tend to speak more slowly and have a harder time expressing themselves — and a control group found that 1.4 percent of transcriptions reviewed yielded hallucinations.

That may not sound like a lot, but in healthcare, such errors could be a matter of life or death. And when you consider the sheer volume of daily patient interactions, it adds up quickly.

Even worse, researchers found that nearly 40 percent of the hallucinations were “harmful or concerning in some way (as opposed to innocuous and random),” introducing significant patient safety and provider liability risks.

Hallucinations included “harms perpetuating violence” (AI writing something that could suggest or encourage violence), “harms of inaccurate associations” (AI connecting unrelated information, leading to a false conclusion) and “harms of false authority” (AI adding or changing information such that it seemed the provider said something with more certainty than they actually did).

Researchers have also found AI scribe hallucinations can include “racial commentary, violent rhetoric and even imagined medical treatments.”

Another thing to keep in mind: So far we’ve only looked at AI hallucinations — not interactions or documentation that, even if accurate, is not following protocols or doesn’t comply with clinical best practices, patient experience standards, regulatory guidelines or accreditation requirements.

If you ask me, it’s clear that while AI scribes do a terrific job overall, they’re far from “set and forget” tools.

Traditional QA can’t keep up

Still, with the right QA practices, can’t hallucinations, mistranscription and other errors — both human and AI — be caught and corrected?

If only.

The problem? Few healthcare organizations have robust QA programs.

In a recent survey, Verbal found that 18 percent of healthcare organizations had no QA program in place at all. Meanwhile, organizations that do QA do so infrequently and on only a small percentage of calls (the Verbal survey found that 52 percent of organizations do QA once per month or less).

The problem is even worse when AI transcriptions are an organization’s only source of truth. As one investigative reporter looking into AI scribe hallucinations detailed, some organizations will delete the original audio of patient interactions, leaving them nothing to fact check the transcription against. “[Obviously, this] could raise some real red flags if what the AI said transpired is really the only record that exists.”

Plus, even if organizations wanted to QA AI outputs more aggressively, they could certainly never review 100 percent of them.

AI QA: A last line of defense?

Given the limitations of traditional QA and the sheer volume of AI-generated clinical documentation, it's clear we need a new approach.

We know unmonitored AI documentation and a “fingers crossed” approach won’t cut it, and manual review by humans isn't scalable (and expecting human reviewers to catch every single error is unrealistic).

So why not take the same strategic approach to QA as many do to documentation itself?

Can we use AI to QA AI?

It’s not such a wild idea.

Imagine a system where another layer of AI is deployed specifically to audit and analyze the outputs of the initial AI scribe or documentation tool. This second AI could be trained to identify common hallucinations, inconsistencies, and deviations from best practices or regulatory requirements. It could flag potential errors and compliance concerns, highlight areas needing human review, and even suggest corrections.

While this wouldn't eliminate 100 percent of the risk (no system can be entirely foolproof), it could serve as a vital last line of defense.

An AI QA system could significantly reduce the burden on human reviewers, allowing them to focus on the most complex and critical cases. It could also provide a level of real-time monitoring that traditional QA methods simply cannot match.

By using AI to monitor AI, we can create a layered approach to quality assurance. The initial AI tool generates the documentation, while the second AI acts as a vigilant watchdog, ensuring accuracy and flagging potential issues.

This two-tiered system could dramatically improve the reliability of AI in healthcare, moving us closer to a future where we can confidently leverage this technology without compromising patient safety.

Bottom line: AI needs QA

To truly benefit from AI in clinical operations, we need to be sure the work AI is doing for us is not just being “done” but “done well.” Given the limitations of manual QA, I think using AI to audit AI could be a terrific solution. We’ll never reach perfection, but AI can help us get much, much closer.

Note: This article originally appeared in Mexico Business News