While AI models can break down problems into structured steps, new research reveals they still fail at basic arithmetic and fact-checking—raising questions about their true reasoning abilities.
Large Language Models (LLMs) have become indispensable in natural language processing, excelling at tasks such as sentiment analysis, reading comprehension, and answering factual questions. However, their ability to perform complex, multi-step reasoning remains a significant challenge, particularly in question-answering tasks that demand logical inference rather than simple recall. This study, authored by Nick Ferguson, Liane Guillou, Alan Bundy, and Kwabena Nuamah from the University of Edinburgh and Aveni, examines the extent to which LLMs can engage in two distinct forms of reasoning: meta-level and object-level reasoning.
Understanding Meta-Level and Object-Level Reasoning
Meta-level reasoning involves high-level strategic thinking, including problem decomposition and the formulation of intermediate steps necessary to solve a question. Object-level reasoning, in contrast, refers to the execution of these steps, such as performing mathematical calculations, retrieving specific facts, or applying symbolic logic. To evaluate the capabilities of LLMs in these areas, the authors introduce FRANKLIN, a novel dataset that explicitly requires models to engage in both reasoning types. FRANKLIN is inspired by the FRANK system, a symbolic reasoning framework for question answering, and focuses on geopolitical indicators such as population trends, economic metrics, and regional comparisons. Alongside three established multi-step question-answering datasets, FRANKLIN serves as a benchmark for testing the performance of four specific LLM versions: Meta’s Llama 3.1 8B, Microsoft’s Phi 3.5 Mini, Google’s Gemma 2 9B, and OpenAI’s GPT-4o-mini. Through two human annotation studies, the researchers assess whether LLMs can successfully generate reasoned responses and whether prompting them to plan their answers before execution improves their performance.
How LLMs Approach Reasoning Tasks
The study situates its analysis within the broader context of LLM reasoning tasks. As a cognitive function, reasoning encompasses logical deduction, belief revision, and inference-making. Common sense reasoning requires an understanding of everyday concepts and the ability to infer implicit knowledge. Mathematical reasoning demands numerical operations and logical problem-solving, while symbolic reasoning involves rule-based manipulations, such as emulating formal logic or deducing relationships between abstract entities. Multi-step reasoning is particularly significant, as it necessitates the sequential application of inference processes to arrive at a final answer. Despite their advancements, LLMs often struggle with these tasks because they rely on statistical pattern-matching rather than genuine logical deduction.
Existing techniques attempt to improve LLM performance on reasoning tasks. Fine-tuning involves additional training on domain-specific datasets to enhance accuracy in particular tasks while prompting techniques such as Chain-of-Thought (CoT) to introduce explicit reasoning steps into model responses. These approaches have demonstrated improvements, yet doubts remain as to whether LLMs are genuinely reasoning or merely imitating structured thought patterns learned from their training data. The authors propose a more structured classification of LLM reasoning, distinguishing between meta-level and object-level processes. While meta-level reasoning involves planning, selecting relevant knowledge sources, and determining the steps required to solve a problem, object-level reasoning focuses on accurate execution, including factual retrieval, numerical precision, and logical deductions.
FRANKLIN Dataset: A New Challenge for LLMs
To assess these reasoning types, the study introduces the FRANKLIN dataset, inspired by the FRANK system, which employs explicit symbolic reasoning to solve complex questions. FRANKLIN consists of complex questions requiring both meta- and object-level reasoning, particularly in the domain of geopolitical indicators. It includes scenarios requiring future prediction, regional comparisons, historical trends, and projections. Unlike more straightforward fact-retrieval datasets, FRANKLIN forces LLMs to not only determine the correct problem-solving approach but also accurately retrieve and manipulate relevant data. Each question is paired with a detailed explanation outlining the necessary reasoning steps. This dataset poses a significant challenge for LLMs, as it requires them not only to determine the appropriate strategy for answering a question but also to accurately retrieve and manipulate data.
How LLMs Were Evaluated: Two Human Annotation Studies
The evaluation design consists of two human annotation studies. In the first, LLMs were prompted to directly answer questions, allowing assessment of their object-level reasoning abilities. In the second, models were first asked to generate a plan before executing their reasoning steps, testing their meta-level reasoning skills. Participants rated responses based on their coherence, correctness, and the presence of structured reasoning. The study also introduced three key evaluation metrics:
- Answer Failure Rate (AFR) – the percentage of cases where an LLM provided no attempted answer.
- Rational Approach Rate (RAR) – the proportion of responses that outlined a coherent problem-solving approach.
- Plan Creation Rate (PCR) – the percentage of responses that structured their reasoning in a clear, step-by-step manner.
The results reveal a clear divergence in LLM performance between these two reasoning levels.
Key Findings: Meta-Level Strength, Object-Level Weakness
Across all datasets, LLMs consistently demonstrated strong meta-level reasoning. Responses often contained structured, step-by-step explanations that human annotators rated as rational and interpretable. Even for complex questions in FRANKLIN, models exhibited an ability to break down problems into intermediate steps and articulate a plan for solving them. However, while these responses appeared structured, the study raises concerns about whether they represent true reasoning or simply an imitation of learned patterns.
In contrast, LLMs struggled significantly with object-level reasoning. Object-level reasoning failures were frequent, particularly when questions required numerical precision or factual recall. In FRANKLIN, for example, models frequently fabricated numerical data, provided incorrect values, or made basic arithmetic errors. Even when models successfully identified the correct reasoning path, they often failed to follow through with accurate computations or fact retrieval. Error patterns included:
- Fabricating numerical data (e.g., citing non-existent sources).
- Retrieving inaccurate or imprecise information (e.g., rounding values incorrectly).
- Performing incorrect calculations (even for simple arithmetic operations).
A closer analysis of errors highlights the nature of these failures. Some responses contained entirely fabricated data, where models cited non-existent sources or invented statistical figures. Others retrieved information with reduced precision, rounding values or omitting key details necessary for accurate comparisons. In mathematical tasks, models often produce incorrect calculations, even for simple operations. These findings suggest that while LLMs can structure their responses in a way that appears logical, they lack the robust execution skills necessary to reliably generate correct answers in domains requiring object-level reasoning.
Implications for LLM Development
The findings have significant implications for the development of LLMs. While prompting models to engage in meta-level reasoning improves their ability to articulate coherent strategies, it does not address their deficiencies in object-level reasoning. This suggests that future advancements must focus on integrating external symbolic reasoning components, improving factual retrieval mechanisms, and refining numerical processing capabilities. The FRANKLIN dataset serves as a critical benchmark, demonstrating that even models with strong problem-decomposition skills struggle with execution.
Conclusion: The Path Forward for AI Reasoning
In conclusion, the study highlights a critical distinction in the reasoning capabilities of LLMs. While they can effectively plan and structure problem-solving approaches, their ability to execute complex reasoning tasks remains limited. The study’s findings emphasize that LLMs are proficient at mimicking reasoning structures but not necessarily reasoning in a human-like, cognitive sense. The introduction of FRANKLIN offers a new means of evaluating these deficiencies, laying the groundwork for further research into improving LLM performance in multi-step question answering. The results underscore the need for continued refinement in how LLMs handle object-level reasoning, ensuring that future iterations can move beyond surface-level imitation and towards genuine cognitive reasoning abilities.
- Preliminary scientific report. Ferguson, N., Guillou, L., Bundy, A., & Nuamah, K. (2025). Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering. ArXiv. https://arxiv.org/abs/2502.10338
 
News
Researchers propose five key questions for effective adoption of AI in clinical practice
While Artificial Intelligence (AI) can be a powerful tool that physicians can use to help diagnose their patients and has great potential to improve accuracy, efficiency and patient safety, it has its drawbacks. It [...]
Advancements and clinical translation of intelligent nanodrugs for breast cancer treatment
A comprehensive review in "Biofunct. Mater." meticulously details the most recent advancements and clinical translation of intelligent nanodrugs for breast cancer treatment. This paper presents an exhaustive overview of subtype-specific nanostrategies, the clinical benefits [...]
It’s Not “All in Your Head”: Scientists Develop Revolutionary Blood Test for Chronic Fatigue Syndrome
A 96% accurate blood test for ME/CFS could transform diagnosis and pave the way for future long COVID detection. Researchers from the University of East Anglia and Oxford Biodynamics have created a highly accurate [...]
How Far Can the Body Go? Scientists Find the Ultimate Limit of Human Endurance
Even the most elite endurance athletes can’t outrun biology. A new study finds that humans hit a metabolic ceiling at about 2.5 times their resting energy burn. When ultra-runners take on races that last [...]
World’s Rivers “Overdosing” on Human Antibiotics, Study Finds
Researchers estimate that approximately 8,500 tons of antibiotics enter river systems each year after passing through the human body and wastewater treatment processes. Rivers spanning millions of kilometers across the globe are contaminated with [...]
Yale Scientists Solve a Century-Old Brain Wave Mystery
Yale scientists traced gamma brain waves to thalamus-cortex interactions. The discovery could reveal how brain rhythms shape perception and disease. For more than a century, scientists have observed rhythmic waves of synchronized neuronal activity [...]
Can introducing peanuts early prevent allergies? Real-world data confirms it helps
New evidence from a large U.S. primary care network shows that early peanut introduction, endorsed in 2015 and 2017 guidelines, was followed by a marked decline in clinician-diagnosed peanut and overall food allergies among [...]
Nanoparticle blueprints reveal path to smarter medicines
Lipid nanoparticles (LNPs) are the delivery vehicles of modern medicine, carrying cancer drugs, gene therapies and vaccines into cells. Until recently, many scientists assumed that all LNPs followed more or less the same blueprint, [...]
How nanomedicine and AI are teaming up to tackle neurodegenerative diseases
When I first realized the scale of the challenge posed by neurodegenerative diseases, such as Alzheimer's, Parkinson's disease and amyotrophic lateral sclerosis (ALS), I felt simultaneously humbled and motivated. These disorders are not caused [...]
Self-Organizing Light Could Transform Computing and Communications
USC engineers have demonstrated a new kind of optical device that lets light organize its own route using the principles of thermodynamics. Instead of relying on switches or digital control, the light finds its own [...]
Groundbreaking New Way of Measuring Blood Pressure Could Save Thousands of Lives
A new method that improves the accuracy of interpreting blood pressure measurements taken at the ankle could be vital for individuals who are unable to have their blood pressure measured on the arm. A newly developed [...]
Scientist tackles key roadblock for AI in drug discovery
The drug development pipeline is a costly and lengthy process. Identifying high-quality "hit" compounds—those with high potency, selectivity, and favorable metabolic properties—at the earliest stages is important for reducing cost and accelerating the path [...]
Nanoplastics with environmental coatings can sneak past the skin’s defenses
Plastic is ubiquitous in the modern world, and it's notorious for taking a long time to completely break down in the environment - if it ever does. But even without breaking down completely, plastic [...]
Chernobyl scientists discover black fungus feeding on deadly radiation
It looks pretty sinister, but it might actually be incredibly helpful When reactor number four in Chernobyl exploded, it triggered the worst nuclear disaster in history, one which the surrounding area still has not [...]
Long COVID Is Taking A Silent Toll On Mental Health, Here’s What Experts Say
Months after recovering from COVID-19, many people continue to feel unwell. They speak of exhaustion that doesn’t fade, difficulty breathing, or an unsettling mental haze. What’s becoming increasingly clear is that recovery from the [...]
Study Delivers Cancer Drugs Directly to the Tumor Nucleus
A new peptide-based nanotube treatment sneaks chemo into drug-resistant cancer cells, providing a unique workaround to one of oncology’s toughest hurdles. CiQUS researchers have developed a novel molecular strategy that allows a chemotherapy drug to [...]
 
									















 
	 
	 
	 
	