Hi, I’m Mario 👋! I am a PhD student in Natural Language Processing at Johannes Gutenberg University Mainz, supervised by Katharina von der Wense in the NALA lab. Previously, I completed my BSc in Computer Science and MSc in Artificial Intelligence. I have also worked as an AI engineer in the healthcare industry.
📚 Research Interests
I’m continually impressed by how large language models, trained on the seemingly “simple” task of next‑word prediction, exhibit surprising emergent capabilities far beyond their original design.
Language Modeling & Emergent Abilities 🤖
How and why do large language models acquire surprisingly complex skills?
LLM Calibration 📊
Techniques to make model confidences better aligned with correctness.
Biomedical NLP 💊
Applying LLMs to assist with drug discovery, clinical note analysis, and literature mining.
Peer-to-peer (P2P) lending connects borrowers and lenders through online platforms but suffers from significant information asymmetry, as lenders often lack sufficient data to assess borrowers’ creditworthiness. This paper addresses this challenge by leveraging BERT, a Large Language Model (LLM) known for its ability to capture contextual nuances in text, to generate a risk score based on borrowers’ loan descriptions using a dataset from the Lending Club platform. We fine-tune BERT to distinguish between defaulted and non-defaulted loans using the loan descriptions provided by the borrowers. The resulting BERT-generated risk score is then integrated as an additional feature into an XGBoost classifier used at the loan granting stage, where decision-makers have limited information available to guide their decisions. This integration enhances predictive performance, with improvements in balanced accuracy and AUC, highlighting the value of textual features in complementing traditional inputs. Moreover, we find that the incorporation of the BERT score alters how classification models utilize traditional input variables, with these changes varying by loan purpose. These findings suggest that BERT discerns meaningful patterns in loan descriptions, encompassing borrower-specific features, specific purposes, and linguistic characteristics. However, the inherent opacity of LLMs and their potential biases underscore the need for transparent frameworks to ensure regulatory compliance and foster trust. Overall, this study demonstrates how LLM-derived insights interact with traditional features in credit risk modeling, opening new avenues to enhance the explainability and fairness of these models.
@article{sanz-guerrero2025credit,title={Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending},author={Sanz-Guerrero, Mario and Arroyo, Javier},year={2025},month=mar,journal={Inteligencia Artificial},volume={28},number={75},pages={220–247},url={https://journal.iberamia.org/index.php/intartif/article/view/1890},doi={10.4114/intartif.vol28iss75pp220-247},}
NAACL’25 Workshop
Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models
In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose *corrective in-context learning* (CICL), an approach that incorporates a model’s incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model’s task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.
@inproceedings{cicl_2025,title={Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models},author={Sanz-Guerrero, Mario and {von der Wense}, Katharina},booktitle={The Sixth Workshop on Insights from Negative Results in NLP},month=may,year={2025},address={Albuquerque, New Mexico},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2025.insights-1.4/},doi={10.18653/v1/2025.insights-1.4},pages={24--33},isbn={979-8-89176-240-4},}
EMNLP’25
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy – tokenizing the space together with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model’s confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.
@misc{sanz-guerrero2025mindthegap,title={Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs},author={Sanz-Guerrero, Mario and Bui, Minh Duc and {von der Wense}, Katharina},year={2025},month=nov,eprint={2509.15020},archiveprefix={arXiv},primaryclass={cs.CL},url={https://arxiv.org/abs/2509.15020},doi={10.48550/arXiv.2509.15020},}