ChatGPT Fails at Scientific Paper Summaries, Study Finds

▼ Summary
– The AAAS conducted a year-long study to test if ChatGPT could summarize scientific papers in the style of their SciPak news briefs.
– ChatGPT was able to mimic the structure of SciPak summaries but often sacrificed accuracy for simplicity in its writing.
– The AI-generated summaries required rigorous fact-checking by human SciPak writers to ensure correctness.
– The study involved summarizing 64 papers with complex elements using various prompts and the latest GPT models available.
– Researchers acknowledged potential human bias in evaluations, as journalists assessed a tool that could impact their core job functions.
When it comes to translating dense scientific research into accessible summaries, ChatGPT struggles to deliver accurate and reliable results, according to a recent informal study. Researchers at the American Association for the Advancement of Science (AAAS) spent a year testing whether the AI could produce summaries comparable to those written by their in-house SciPak team. These briefs are crafted to help journalists quickly grasp study premises, methods, and context. While the AI managed to mimic the basic structure of a SciPak summary, it consistently sacrificed accuracy for simplicity, requiring extensive fact-checking by human writers.
The evaluation involved selecting up to two scientific papers each week from December 2023 through December 2024. The chosen studies often contained challenging elements like technical terminology, controversial conclusions, or innovative methodologies. Using the latest available GPT-4 and GPT-4o models, the team generated summaries with three distinct prompts of varying detail. In total, 64 papers were processed and assessed.
Human SciPak writers, who had originally summarized those same studies, evaluated the AI-generated versions using both quantitative metrics and qualitative judgment. They found that while the structure often resembled their own work, the content was prone to errors and oversimplification. The prose tended to gloss over nuance, sometimes misrepresenting findings or omitting critical context.
Abigail Eisenstadt of AAAS noted that although these tools show promise as aids for science writers, they are not yet ready for independent use in high-stakes environments. The need for rigorous human oversight remains unavoidable. One notable limitation of the study design was its inability to fully account for potential human bias, especially given that the evaluators were assessing a tool that could one day automate aspects of their own roles.
(Source: Ars Technica)