close

Research - 18.12.2025 - 14:00 

New Benchmark shows: AI understands Finance but is often blind when searching for Information

AI has great potential in financial analysis. In a recent study by the University of St. Gallen (HSG), a new benchmark was developed to test the performance of large language models in reviewing and interpreting annual reports. The study showed that current AI models are good at interpreting but poor at searching for important information.

Every year, financial analysts face the mammoth task of sifting through hundreds of complex annual reports in order to make informed investment decisions. This time-consuming and demanding task seems predestined for automation through artificial intelligence. To test how well current large language models (llm) perform this task, doctoral student Jan Spörer from the School of Computer Science at HSG (SCS-HSG) developed the new “Financial Touchstone” benchmark. This contains 2,878 question-answer pairs relating to 480 international annual reports. These reports were not included in the original training of the AI models. The llms tested were therefore unable to retrieve the answers to these questions from their trained “memory” and had to look them up in the annual reports. Jan Spörer then used this benchmark to test eleven llms, including reasoning models such as Google's Gemini 2.5 Pro, Anthropic's Claude Opus, and OpenAI's o3.

Better understanding than humans

The results of the study showed a remarkably high level of accuracy for AI. Gemini 2.5 Pro led the models with 91.6% correct answers. The model hallucinated, or invented information, in only 3.2% of the answers. The accuracy of the language models even surpassed the measured human performance of 82.8% correct answers. However, the study revealed a crucial weakness of AI financial analysts: it turned out that the main cause of their incorrect answers was not a lack of comprehension, but the initial step of gathering information. Two-thirds of all errors can be attributed to problems retrieving relevant information from the often page-long annual reports. “AI has difficulty finding the needle in the haystack,” says Prof. Dr. Siegfried Handschuh, who provided technical support for the work.

Mapping annual reports

To circumvent this bottleneck, the researchers tested a more advanced approach to retrieving information in a separate follow-up study: the so-called GraphRAG method. This second study was conducted jointly by Jan Spörer, Michael Gaus, and Prof. Dr. Siegfried Handschuh. The GraphRAG method first creates a “map” for each of the extensive annual reports. This is done by using a language model to first extract all the important facts from the annual report (e.g., financial figures, business areas, legal entities) and how they relate to each other. This information is then organized in a knowledge graph, with the facts represented as nodes and the relationships as edges. Special algorithms are then used to divide this graph into thematic groups. The llm then creates comprehensive summaries for each of these groups at various levels of detail. When a complex question is asked, the system no longer has to search through the entire raw text, but instead navigates through the structured “map” and uses the summaries to compile information. This approach is particularly valuable for finding answers that are spread across different sections, such as consolidated results across multiple business segments.

Better information retrieval but more hallucinations

The results of the GraphRAG method on the Gemini 2.5 Pro model are promising: the accuracy of answering questions increased by 2.1 percentage points. This shows that the knowledge graph structure actually helps to capture complex relationships and aggregate answers across different sections of a document. However, the GraphRAG method also has a downside: the rate of hallucinations increased by 6.1 percentage points. “This suggests that the group summaries can sometimes confuse the model,” says Siegfried Handschuh.

The results of the two studies are also relevant for the latest language and reasoning models such as Gemini 3: "It is to be expected that the latest models will continue to improve in terms of pure comprehension and reasoning performance. However, the studies show that the central bottleneck in analyzing very long and complex documents is not comprehension itself, but rather the reliable finding and merging of relevant information," says Siegfried Handschuh. GraphRAG represents a promising direction for future research, as it enables more efficient, comprehensive, and accurate information retrieval.

Discover our special topics

north