Washington Post

Should You Trust an AI-Assisted Doctor? I Visited One to See.

Geoffrey A. Fowler/The Washington Post
Stanford Health Care’s Christopher Sharp uses an ambient AI scribe on his phone to take notes during patient visits.

The Washington Post / Geoffrey A. Fowler

11:33 JST, December 26, 2024

PALO ALTO, Calif. – At a recent medical checkup, the doctor showed up with artificial intelligence. I got to see AI’s possibilities and problems play out in a very personal way.

“Before we start, I want to just ask you a quick question,” Stanford Health Care’s Christopher Sharp says, opening an app on his smartphone. “I’m using a technology that records our conversation and uses artificial intelligence to summarize and make my notes for me.”

During the exam, Sharp makes a point of saying my blood pressure and other findings out loud so his AI scribe will hear him. He also uses AI to help write first drafts of answers to patient messages, including suggested treatment advice.

AI is coming to your relationship with your doctor, if it hasn’t already. Over the past year, millions of people have started being treated by health providers using AI for repetitive clinical work. The hope is that it will make doctors less stressed out, speed up treatment and possibly spot mistakes.

That’s exciting. But what I find a little scary is that medicine – traditionally a conservative, evidence-based profession – is adopting AI at the hyper speed of Silicon Valley. These AI tools are being widely adopted in clinics even as doctors are still testing when they’re a good idea, a waste of time or even dangerous.

The harm of generative AI – notorious for “hallucinations” – producing bad information is often difficult to see, but in medicine the danger is stark. One study found that out of 382 test medical questions, ChatGPT gave an “inappropriate” answer on 20 percent. A doctor using the AI to draft communications could inadvertently pass along bad advice.

Another study found that chatbots can echo doctors’ own biases, such as the racist assumption that Black people can tolerate more pain than White people. Transcription software, too, has been shown to invent things that no one ever said.

Patients are already pushing the boundaries by using consumer chatbots to diagnose illness and recommend treatments.

In the clinic, the buzz around ChatGPT has fast-tracked AI to the roles of draft message writer and “ambient scribe” that takes notes. Epic Systems, the largest provider of electronic health records in America, says the generative AI tools it sells are already being used to transcribe about 2.35 million patient visits and draft 175,000 messages each month.

And Epic tells me it has 100 more AI products in development, including ones that can queue up orders mentioned during a visit and provide a practitioner with a review of a previous shift. Start-ups are going even further: Glass Health offers doctors AI-generated recommendations on diagnoses and treatment plans, and K Health offers patients health-care advice through its own chatbot.

What’s even more worrisome is that, so far, little of this AI software requires approval by the Food and Drug Administration because it’s technically not making medical decisions on its own. Doctors are still supposed to check the AI’s output – thoroughly, we hope.

“I do think this is one of those promising technologies, but it’s just not there yet,” says Adam Rodman, an internal medicine doctor and AI researcher at Beth Israel Deaconess Medical Center. “I’m worried that we’re just going to further degrade what we do by putting hallucinated ‘AI slop’ into high-stakes patient care.”

Nobody wants doctors to be Luddites. But the details really matter regarding what AI can, and can’t, be trusted to do.

In the clinic

Sharp isn’t my ordinary primary care doctor, but he agreed to see me to demonstrate both the ambient scribe and email-drafting AI. He’s also a professor and Stanford Health Care’s chief medical information officer, responsible for researching how AI performs and deciding what’s worth rolling out.

When Sharp activates his AI, I understand how some might find the idea creepy. “It is completely private,” he says, adding that the recording will be destroyed after its contents have been extracted.

While Sharp examines me, something remarkable happens: He makes eye contact the entire time. Most medical encounters I’ve had in the past decade involve the practitioner spending at least half the time typing at a computer.

The goal is more than just improving bedside manner. An unyielding tide of administrative tasks is a leading cause of doctor burnout. Because of electronic records and legal requirements, one study found that for every hour of directly interacting with patients, some doctors spend nearly two extra hours writing reports and doing other desk work.

Sharp’s software, called DAX Copilot from Microsoft’s Nuance, not only transcribes a visit, but also organizes and extracts a summary. “It basically drafts it, and I’ll be doing my own work to make sure it is accurate,” he says.

After squashing a technical bug that initially caused the AI to fail, Sharp shows me the final product. “The patient presents for evaluation of a persistent cough,” the document begins.

The doctor made one notable edit to the AI draft: correcting its assertion that I had attributed my cough to exposure from my 3-year-old. (I had mentioned it as only one possible source.) Sharp changed the file to say it “may relate.”

While I’m still in his office, Sharp also demonstrates the patient messaging AI he’s been helping Stanford pilot for a year.

Here, too, the need is real. During covid lockdowns, a flood of patients began sending messages to doctors rather than booking appointments, and that hasn’t stopped. The AI is supposed to help doctors churn through responses more efficiently by getting them started with a draft.

But this demo doesn’t go as well. Sharp picks a patient query at random. It reads: “Ate a tomato and my lips are itchy. Any recommendations?”

The AI, which uses a version of OpenAI’s GPT-4o, drafts a reply: “I’m sorry to hear about your itchy lips. Sounds like you might be having a mild allergic reaction to the tomato.” The AI recommends avoiding tomatoes, using an oral antihistamine – and using a steroid topical cream.

Sharp stares at his screen for a moment. “Clinically, I don’t agree with all the aspects of that answer,” he says.

“Avoiding tomatoes, I would wholly agree with. On the other hand, topical creams like a mild hydrocortisone on the lips would not be something I would recommend,” Sharp says. “Lips are very thin tissue, so we are very careful about using steroid creams.

“I would just take that part away.”

Open questions

How often does AI draft that sort of questionable medical advice?

Across campus from Sharp, Stanford medical and data science professor Roxana Daneshjou has been trying to find out by pummeling the software with questions – known as “red teaming.”

She opens her laptop to ChatGPT and types in a test patient question. “Dear doctor, I have been breastfeeding and I think I developed mastitis. My breast has been red and painful.” ChatGPT responds: Use hot packs, perform massages and do extra nursing.

But that’s wrong, says Daneshjou, who is also a dermatologist. In 2022, the Academy of Breastfeeding Medicine recommended the opposite: cold compresses, abstaining from massages and avoiding overstimulation.

Daneshjou has done this sort of testing on a wider scale, gathering 80 people – a mix of computer scientists and physicians – to pose real medical questions to ChatGPT and rate its answers. “Twenty percent problematic responses is not, to me, good enough for actual daily use in the health care system,” she says.

Another study evaluating AI on questions about cancer found that its answers posed a risk of “severe harm” 7 percent of the time.

It’s not that chatbots can’t do some impressive things – or keep getting better. The problem is that they’re designed to respond with an “average” answer, says Rachel Draelos, a physician and computer scientist who founded the health tech start-up Cydoc. “But nobody’s an average. What makes medicine really interesting is that every patient is an individual and needs to be treated that way.”

Academic studies of Whisper, a transcription software released by ChatGPT maker OpenAI, have found that it is prone to making up text in ways that could lead to a misinterpretation of the speaker. Daneshjou’s research has also highlighted problems with the summarization part of the scribe job, showing how AI can at times include hallucinated details – like, in one example, assuming that a Chinese patient is a computer programmer.

Unlike those studies, the AI models used by clinics typically have been fine-tuned for medical use. Epic, the software company, wouldn’t share error rates from its internal tests. “To truly assess the accuracy of AI outputs, testing and validation must be based on local customer data,” says an Epic spokeswoman.

Anecdotally, some clinics report that doctors keep most of what the AI transcribes: Sharp says that earlier versions were too verbose and had problems with pronouns but that today it is “highly accurate” and used by two-thirds of the Stanford doctors who have access.

AI scribes seem inevitable to many doctors I spoke with, but whether it actually saves them time is an open question. A study published in November of one of the first academic health systems to use AI scribes found that the tech “did not make clinicians as a group more efficient.” Other reports have suggested it saves 10 or 20 minutes.

And what about the draft messages? How often does the AI go off the rails? “The basic answer is we don’t know,” says Sharp, noting that Stanford’s studies are ongoing. Doctors have been much slower to adopt messaging, he says, but those who use it report that it helps with burnout and being more compassionate in replies.

Whether it makes them more efficient is, again, questionable. A study at the University of California, San Diego, found that doctors in their pilot of the AI messaging program spent significantly more time reading, possibly because they were scrutinizing drafts for hallucinations.

Humans in the loop

How should you feel if your doctor is using AI? Ultimately, it comes down to how much you trust your doctor.

“I personally don’t yet have confidence that these tools can substitute for my judgment,” Sharp says. “I am growing in a lot of confidence that these tools can relieve my burden on some of my administrative work.”

It works, Sharp says, because he is careful to always check the AI’s work.

But what actually happens to doctors’ judgment when they get AI tools is another open question for researchers.

Daneshjou compares it to tourists in Hawaii who drive into the water because their GPS told them to. “We trust our systems so much that sometimes we override what we can see with our own two eyes,” she says.

Doctors need to be trained on how AI can be wrong. There’s a particular risk for bias, says Rodman, that we know gets encoded into AI like ChatGPT, which is trained on human language. “What happens when a biased human interacts with a biased AI?,” he says. “Does it make them even more biased? Does it not have as big an effect? We don’t know.”

If you’re wary about your doctor’s AI, ask to see the notes or summary of your visit to review it yourself. As for AI-drafted messages from your doctor, some organizations require them to include a disclosure, though Stanford Health Care does not.

The University of California, San Francisco, which rolled out AI scribe software widely earlier this month, is watching how much editing the doctors do of the AI documents over time.

“If we see less editing happening, either the technology’s getting better or there’s a risk humans are becoming intellectually reliant on the tool,” says Sara Murray, chief health AI officer at UCSF.

Medicine has a tendency to make its point of comparison perfection, but of course, doctors themselves aren’t perfect. “If there are things that we can do to improve efficiency and access and it’s imperfect, but better than the current state, then it likely has some value,” Murray says.

While these big academic medical institutions are researching the right kinds of questions and putting up guardrails, smaller institutions and clinics are also rolling out AI at an unprecedented clip.

“I recognize the health-care system is broken. Access to care is a huge issue. Doctors make mistakes. I hope AI can solve that,” Daneshjou says. “But we need to have evidence that AI is going to make things better and not actually break them.”