In a groundbreaking study published in the npj| Digital Medicine journal, a team of researchers led by Jesutofunmi A. Omiye, Jenna C. Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou has revealed alarming evidence that large language models (LLMs) may perpetuate race-based medicine, posing significant ethical and clinical risks to the healthcare field. The study assessed four commercially available LLMs and found that these models produced harmful, inaccurate, race-based content in their responses to medical queries.
The study, which comes at a time when LLMs are being increasingly integrated into healthcare systems, aimed to investigate whether these AI models propagate harmful and debunked race-based medical ideas when responding to a series of eight different medical scenarios. Questions were formulated by four physician experts, inspired by prior work on race-based medical misconceptions that are often believed by medical trainees.
The LLMs assessed in the study included OpenAI’s ChatGPT, GPT-4, Google’s Bard, and Anthropic’s Claude, with responses obtained from different versions of these models between May 18 and August 3, 2023.
Key Findings of the Study:
- Perpetuation of Race-Based Medicine: All four LLMs produced responses that perpetuated race-based medicine or repeated unsubstantiated claims related to race. These models were not consistent in their responses when asked the same question multiple times.
- Kidney Function and Lung Capacity: The LLMs demonstrated concerning outputs when asked about topics such as kidney function and lung capacity. Some models promoted the use of race in calculations, citing false assertions about Black individuals having different muscle mass and creatinine levels.
- Skin Thickness and Pain Threshold: Questions about skin thickness and pain threshold differences between Black and white patients led to all LLMs sharing erroneous information about these differences, even though scientific evidence does not support such claims.
- Brain Size: In response to questions about the size of brains in Black and white individuals, all models correctly stated that there are no differences, with some models explicitly labeling such ideas as racist and harmful.
The implications of these findings are profound, given the growing interest in implementing LLMs in healthcare settings, including connections to electronic health record systems. The study suggests that these models may inadvertently perpetuate racial biases and incorrect medical assumptions. This could potentially influence healthcare practitioners’ decision-making processes, leading to biased patient care and, in some cases, harmful outcomes.
The study calls for greater transparency in the development and use of LLMs, with specific attention to potential biases. It emphasizes that LLMs need further evaluation and adjustments before they can be considered safe for clinical use or integration into the healthcare field. The researchers urge medical centers and clinicians to exercise caution when using these models for medical decision-making, medical education, or patient care.
While LLMs have shown their potential in various medical specialties, this study underscores the need for thorough evaluation and a rigorous assessment of potential biases before widespread deployment in healthcare. The researchers advocate for larger quantitative studies to ensure patient safety and for an increased level of transparency in the development of LLMs.
In conclusion, as the healthcare industry continues to explore the use of LLMs, the ethical and clinical concerns highlighted in this study should serve as a wake-up call for the industry to prioritize patient safety and unbiased, evidence-based care when integrating AI into healthcare systems.