Risks of automation in medicine - a review article for the obstetrics case


Joris Fournel1, Aasa Feragen1 & Martin Tolsgaard2
The current overoptimism surrounding AI threatens to disrupt clinical environments. We discuss the risks related to invasive automation and describe how AI performance exaggeration may lead to clinical errors. We also investigate cognitive damage to clinicians: automation bias and de-skilling, by exploring their relation to well-defined psychological principles. Finally, we propose solutions for automation that actually benefit medicine: generalise model bias assessments and quality control, and ensure that AI integration does not erode clinicians' clinical expertise.
Will "artificial intelligence" (AI) systems benefit or disrupt clinical practice? Is the automation of clinical decisions going to improve or worsen clinical care? Are AI systems going to make doctors more or less competent?
The current overoptimistic atmosphere is detrimental to answering these questions. Claims of extraordinary AI performance far outpace reports of negative results, which are hardly ever published [1]. On social media, positive sentiments towards AI in medical imaging are expressed five times more often than negative sentiments [2]. Some described this trend as “hype" [3], whereas Sam Altman, the CEO of OpenAI, characterised the broader AI atmosphere as a "bubble" [4].
Hype and enthusiasm are emotional states that distort our perception of reality, leading us to overlook dangers in hasty decisions. Therefore, this work proposes a sober examination of the potential dangers of automation, which is urgently needed to avoid disrupting clinical environments as fragile as the human lives they care for.
We discuss when introducing automation – a machine performing a task instead of a clinician – may harm clinics: automation is not justified a priori; model errors lead to clinical errors (technical risks); the judgment of the clinicians is impaired by AI usage (cognitive risks).
The main concern uniquely addressed in this paper includes concepts such as de-skilling and, expanding beyond existing work [5], their relation to cognitive psychology and their effect on longer-term complications for clinicians.
Specifically, we focus on AI in obstetrics as a use case that represents a clinical specialty that draws on imaging, surgery and decision-making. Whereas most published literature involving AI in medicine has been rooted in radiology or pathology, AI and automation are increasingly used in clinical specialities such as obstetrics, cardiology and surgery.
Is automation always a good idea?
Before risking introducing a machine, one should have solid reasons to do so. Unjustified automation will always do more harm than good. For example, focusing on the automated task in isolation from its broader environment may cause us to overlook a downstream negative impact. Furthermore, the arguments provided when introducing automation may reinforce a false and insidious narrative about AI or doctors. Therefore, reviewing how researchers generally motivate new automations may serve to alert us to existing trends.
The body of research in obstetrics provides a good basis for analysing these arguments. Automated models have been developed for probe guidance [6], standard plane identification [7], foetal biometric measurement [8], Doppler information extraction [9], anomaly highlighting [10], gestational age prediction [11], birth weight prediction [12] and prognosis including preterm [13], intra-uterine growth restriction [14] and pre-eclampsia prediction [15]. Deep learning has also been applied in foetal cardiotocography and foetoscopic surgery [16, 17].
Here, we review and classify authors’ justifications for developing models found in this literature, ordered from what we consider the most legitimate to the most questionable, both in terms of validity and overshadowed consequences:
1. Screening and risk stratification improved by machines. Obstetric medicines occasionally rely on simple (but powerful) risk stratification parameters. Automatic models can help clinicians discover and extract new risk parameters - for example, by refining baseline cervical-length thresholds in preterm birth screening [13].
2. Improve clinician training by incorporating automated support systems. For example, Lei et al. reported that a trainee cohort achieved prenatal screening quality requirements in significantly fewer training cycles when assisted by an automated system. Similarly, improved operator performance in perinatal ultrasound screening has been described [18]. This is a particularly desirable application because it strengthens clinical expertise rather than replacing it.
3. Retrieve information from healthcare records. When clinicians cannot scroll through a patient’s history to find relevant information, a model can be used to query it. ChatEHR is an AI-based software currently being piloted at Stanford Medicine [19]. Such applications are certainly a net gain, provided they do not introduce tool dependency. However, AI summarisation (as ChatEHR also claims to do) differs from retrieval: it carries a risk of grounding clinical decisions in incorrect and hallucinated "information”.
4. Free clinicians from "time-consuming" tasks. Some tasks are assumed by authors to be an obstacle to higher-level tasks [20]. This is also described as "reduce the workload" or "reducing the duration of the examination" [9]. For example, Matthew et al. reported an average time saving of 7.62 minutes per scan from adopting an AI-assisted approach for biometric measurements and plane detection [8]. However, some "time-consuming" tasks may be essential for clinicians to develop their skills.
5. Replace the trained clinician. Examples: in a study entitled "No sonographer, no radiologist, Arroyo et al. advocate for sonographer-less obstetric care in rural and under-resourced communities [21], while Aguado et al. argue for Doppler image analysis "even for non-trained readers" [9]. Similarly, Ramirez Zegarra et al. recommend automation because ultrasound acquisition "requires years of training and extensive knowledge of feotal anatomy" [22]. To support this argument, a context of "global shortage of imaging experts" is invoked [20].
6. "Machines are intrinsically superior to humans". This justification might seem surprising initially, but it is one of the most frequently suggested. Machines are said to have reproducibility, absolute consistency over time, whereas clinicians´ "performance" does not possess those attributes [20]. Human intelligence is presented as "error-prone",, subjective and altered by fatigue, whereas machines introduce reproducibility, absolute consistency over time and never get tired [20]. Overall detection rates of fetal malformations are described as "low" due to the "human factor" [22], even when cited studies report prenatal detection "already accounting for 50% or more of critical congenital heart defects detected in many programmes" and as "increasing", with some very high rates in certain areas (87% in France). Intraobserver variability in foetal biometry measurements is described as "high" [22] when the reference study reports 3.0% to 6.0% differences. This narrative is not specific to obstetric medicine, as outperformance claims over humans have become common, as in this title: "ChatGPT with GPT-4 outperforms emergency department physicians in diagnostic accuracy: retrospective analysis" [23]. As Drogt et al. rightly note: "these outperformance claims often lack specificity, contextualisation and empirical grounding" [24]. Even so, the idea that machines surpass clinicians’ intelligence and skill is repeated and used as a major implicit argument for automation. This systematic depreciation of human performance may deter students from engaging in clinical training, affect doctors' morale and lead to unrestrained automation.
Let us now suppose that a properly grounded automation reaches the clinics. The machine can still produce errors that go unnoticed and are used in clinical decision-making. Below, we review the technical and systemic factors that increase the likelihood of such occurrences and explain how some of them can be addressed.
First, this risk is documented. In a study by Matthew et al., AI tools saved a satisfactory set of 13 ultrasound views in 73% of cases, whereas manual scanning achieved a 98% success rate [8]. Another study estimating gestational age reported mean errors ranging from 1.45 to 7.73 days when using poorly segmented images [25]. Such errors can have clear and serious clinical implications, such as missed post-term pregnancies.
In particular, the likelihood of these errors increases in ultrasound imaging due to certain specific factors: noisy images (speckle noise); weak contrast between tissues compared with magnetic resonance imaging (MRI) or computed tomographies; significant variations depending on the patient, foetal position, probe type, probe angle, applied pressure and coupling quality; non-isotropic pixel spacing (which fluctuates depending on direction) and partial fields of view; changing numbers of images and the appearance order of structures per session. Additionally, text and callipers increase the risk of shortcut learning, where machines rely on misleading cues and fail to generalise well beyond the training set [26].
Second, skewed validation increases the likelihood of trusting errors in practice. More often than not, models are evaluated with unrealistic test datasets that foster performance exaggeration and unawareness of model biases [27]. This distortion is aggravated by publication bias, where studies reporting "state of the art" performance are more likely to be published, regardless of reproducibility [1, 28].
A model is biased when its performance reliably declines on certain subgroups: ethnicity, Body Mass Index (BMI), image quality, machine (ultrasound machines are replaced more frequently than other devices [29]), etc. Without transparency into these biases, the user has no insight into which patients can be safely assessed by the model. However, to date, studies that evaluate model bias have been the exception [7, 30]; their absence has been common. Detailed bias analyses should become standard practice. Then, depending on the context, the user can make an informed decision about whether the risk of error is acceptable.
Other solutions to avoid models’ errors in clinical workflows include automatic quality control methods that associate a metric of quality to the output of a model [25] or to its input [31]. Explainable models that provide explanations along with predictions can facilitate the detection of absurd outputs [31]. Even so, clinicians have been shown to trust erroneous results even when provided with explanations [32].
Even a justified and error-free automation will be harmful if it impairs the critical judgment of its users, the clinicians. This section reviews existing proof of this phenomenon, describes its psychological cause and suggests solutions to avoid it.
Short term: automation bias
Automation bias is the user's tendency to follow an automatic system "decision" even when it is incorrect and in the presence of contradictory information. In a randomised clinical trial, clinicians favoured automated decision-making systems despite contradictory or clinically nonsensical information [33]. Automated support was given to 457 clinicians to diagnose clinical vignettes. The diagnostic accuracy modestly increased from 73% to 76% with support from a good model, but dropped from 73% to 62% with outputs from a biased model. Providing model explanations did not mitigate this harmful effect (73% to 64%). As Khera et al. rightly note, this concerning automation bias occurred in «controlled settings, without the usual pressure on time [34].» In another study, Dratsch highlighted that this loss of competence occurred regardless of level of clinical experience [32]. Their prospective experiment asked 27 radiologists to assess 50 mammograms with AI assistance. The machine suggestion was incorrect for 12 mammograms. The AI correctness significantly impacted the percentage of correct ratings for inexperienced (80% versus 20%), moderately experienced (81% versus 25%) and even very experienced (82% versus 45%) radiologists.
Long-term effect: de-skilling or impact on critical thinking and sense of responsibility
Erosion of clinical expertise
Until now, we have focused extensively on the short-term effects of automation on the clinician’s faculties. However, it seems reasonable to also consider the long-term consequences of automation. Everything is dictated by the following principle: any knowledge or skill that is not practised is lost.
To predict the long-term impact on clinicians, examining how automation has influenced human capacities in other sectors can be helpful. For example, having digital devices remember information for us has led to digital amnesia [35] and a loss of spatial memory [36].
A notable warning comes from the experience with cockpit automation. During the eighties, the American Congress asked NASA to investigate how automation affected pilots. First, Earl Wiener, analysing crash reports, concluded that some major accidents were caused by automation [37]. Stephen Casner from the Cames Research Center then examined the issue of inattention and skill retention. By observing pilot-computer interactions in simulators, he showed that pilots’ ability to make complex cognitive decisions suffered a palpable reduction from automation; the more automation there was, the more pilots reported "mind wandering" or thinking about inconsequential topics [38].
More recently, some researchers have begun studying AI-induced de-skilling. In a paper entitled Your Brain on ChatGPT, Nataliya Kosmyna followed the neural activity of essay writers for four months and observed a significantly lower activity (and connectivity) in the brain of those "assisted" by ChatGPT [39]. Michael Gerlich sent a questionnaire to more than 600 persons and reported a -0.68 (!) correlation coefficient between AI usage and critical thinking score [40].
Clinicians possess no special immunity to this long-term, insidious de-skilling [41]. AI systems only offer an illusion of thinking themselves [42] and foster the illusion of understanding in users [43]. Hence, clinicians would maintain the impression of performing the tasks themselves: de-skilling may occur without their noticing. Clinicians' cognitive perspective is far richer than the models’ correlational process [44]. This loss must be avoided.
Erosion of responsibility
Another clinician’s virtue that requires practice is moral decision-making, as it presupposes responsibility. This key aspect is rarely discussed [20]. Here, a well-documented psychological principle should be considered: the diffusion of responsibility or "bystander effect" [45]. In the presence of another potential agent, individuals are less likely to assume responsibility and act autonomously, even when doing so would benefit the group as a whole [45]. The Milgram experiment highlighted that such blind compliance is aggravated when the other agent is perceived as an authority or an "expert" [46].
The cause is mainly psychological
Introducing a machine next to the clinician can result in: (1) Turning a solitary worker into a perceived group of two operators; (2) Offering an effortless way to reach an objective; (3) Creating an apparently safe, comfortable environment, where safety nets make errors and inattention inconsequential. But all these conditions have been associated with a performance drop in Psychology:
1. The Ringlemann effect (or social loafing). Performance and motivation have been shown to decline significantly when a perceived co-worker is added to a task, compared with working alone [47, 48].
2. The principle of least effort. This principle states that people naturally choose the path of least resistance or "effort" [49]. Humans perceive avoiding effort as gratifying [50] and reduce effort through "cognitive offloading". The principle of least effort results in cognitive miser: "People are limited in their capacity to process information, so they take shortcuts whenever they can" [51]. For example, the mere presence of a smartphone has been shown to reduce the available cognitive capacity [52]. "People look up information that they actually know or could easily learn, but are unwilling to invest the cognitive cost associated with encoding and retrieval" [53].
3. The Yerkes Dodson law (or necessity of pressure). This law states that human performance declines in the absence of a stimulating and competitive environment [54, 55].
Solutions
What is unacceptable is the creation of an environment that encourages idleness, mind-wandering, false security or irresponsibility. It would undermine sound medical practice. A recent review on automation in foetal ultrasound described the "ideal" sonographer-machine collaboration as one in which the system would "work in real time", and where the sonographer would utilise its outputs and identify when it fails [56]. The clinician would be the "assessor" of machine output. This is precisely the setting that Stephen Casner identified as undesirable in pilots: “What we’re doing is using human beings as safety nets or backups to computers, and that’s completely backward; it would be much better if the computing system watched us and chimed in when we do something wrong” [38, 57]. Earl Wiener reached the exact same conclusion regarding cockpit automation [37]. And the above statements align with Pascal Blatzer reflecting upon automation bias in medical application [58]. We must respect the laws of human psychology when building automated systems.
More research is needed to identify and eliminate automation bias among clinicians. The ideal configuration in which the machine actually benefits the clinician’s cognitive and moral engagement has yet to be identified. Such a result can only flow from interdisciplinary collaboration involving psychology, learning science, human-computer interaction, machine learning and medical scientific communities.
To this day, the Food and Drug Administration (FDA) leaves the door wide open to clinician deskilling, never mentioning deskilling or automation bias in its January 2025 guidance document [59]. Only design validation is mandatory; bias assessment and quality control, advised; and complete passivity of the clinician, permitted: "AI-enabled devices span a continuum of decision-making roles from more autonomous systems to supportive tools" [59].
Once identified by researchers, the principles that regulate healthy and harmful human-computer interaction must be translated into regulations, with the assistance of legal advisors, to ensure integrated devices will not cause deskilling.
This review identified risks in automating clinics, using obstetrics as an illustrative example. We hope that our work, offering consistent food for thought, will inspire meaningful reflections among clinicians and researchers. Among other undesirable outcomes, a situation in which de-skilled clinicians become completely dependent on machines owned by a few private companies must be avoided at all costs. In particular, our main contribution is to connect psychological principles to automation bias and deskilling issues. We also highlighted the shortcomings of current regulations with regard to de-skilling.
However, a number of concerns still remain; we have not discussed the crucial influence of the industrial-academic ecosystem and associated publication bias on AI. We also have not discussed the patient-focused risks of depersonalisation, disruption of relationships between clinicians and patients, or the risk of patient deception when AI diagnostics are framed as "personalised" to overcome patients’ resistance to AI.
Correspondence Joris Fournel. E-mail: jorfo@dtu.dk
Accepted 26 January 2026
Published 26 February 2026
Conflicts of interest JF reports financial support from or interest in the Novo Nordisk Foundation. AF and MT report financial support from or interest in the Novo Nordisk Foundation and Prenaital Aps. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. These are available together with the article at ugeskriftet.dk/DMJ.
Cite this as Dan Med J 2026;73(4):A10250852
doi 10.61409/A10250852
Open Access under Creative Commons License CC BY-NC-ND 4.0