This is predictable:
Commonly used AI models fail to accurately diagnose or offer advice for many queries relating to women’s health that require urgent attention.
A group of 17 women’s health researchers, pharmacists and clinicians from the US and Europe drew up an initial list of 345 medical queries across five areas, including emergency medicine, gynaecology and neurology. These experts then reviewed the answers provided by a randomly chosen AI model for each question. Those that led to inaccurate responses were collated into a benchmarking test of AI models’ medical expertise that included 96 queries.
This test was then used to assess 13 large language models, produced by the likes of OpenAI, Google, Anthropic, Mistral AI and xAI. Across all the models, some 60 per cent of questions were answered in a way the human experts had previously said wasn’t sufficient for medical advice. GPT-5 performed best, failing on 47 per cent of queries, while Ministral 8B had the highest failure rate of 73 per cent. [“AI chatbots miss urgent issues in queries about women’s health,” Chris Stokel-Walker, NewScientist (17 January 2026, paywall).]
Just as if we were training our doctors using grifters touting silver-nitrate we would expect the doctors to be ineffective, even deadly, the same goes for AI (or, more accurately, machine-learning programs) trained on data from the Web, notorious for harboring those whose lust for wealth erases any inclination towards honesty or effective product development. This is just an expensive lesson in Garbage In, Garbage Out (GIGO).
It’s something to keep in mind if you’re doing your own research – it’s a lot harder than regurgitating the first item popping up on a search string. It can kill you.
So consult reputable, trained professionals.
