Mario Sanz-Guerrero

Hi, I’m Mario 👋! I am a PhD student in Natural Language Processing at Johannes Gutenberg University Mainz, supervised by Katharina von der Wense in the NALA lab. Previously, I completed my BSc in Computer Science and MSc in Artificial Intelligence. I have also worked as an AI engineer in the healthcare industry.

📚 Research Interests

I’m continually impressed by how large language models, trained on the seemingly “simple” task of next‑word prediction, exhibit surprising emergent capabilities far beyond their original design. Yet this power raises pressing questions about trustworthiness – can we actually trust what these models say, and can we trace why they say it?

LLM Calibration 📊

How can we make a model’s confidence a reliable signal of its actual correctness?
Training Data Attribution in LLMs 🔍

Which training examples shape a model’s predictions and behaviors, and how can we trace their influence?
Biomedical NLP 💊

How can we leverage LLMs to accelerate drug discovery, clinical note analysis, and literature mining?

News

Feb. 2026	📄 Our paper, “Peak Attention U-Net: Enhancing ECG delineation with attention” was accepted to the journal Biomedical Signal Processing and Control!
Nov. 2025	📄 Our paper, “Mitigating Label Length Bias in Large Language Models” was accepted to AACL 2025 (Main) in Mumbai, India 🇮🇳!
Sep. 2025	📄 Two of our papers, “Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs” and “Molecular String Representation Preferences in Pretrained LLMs”, were accepted to EMNLP 2025 (Main)! See you in Suzhou, China 🇨🇳!
Aug. 2025	📄 Our paper, “Reducing leads, enhancing wearable practicality: A comparative study of 3-lead vs. 12-lead ECG classification” was accepted to the journal Medical Engineering & Physics!
Jul. 2025	I’ll be attending ACL 2025. See you in Vienna, Austria 🇦🇹!

Selected Publications

BSPC
Peak Attention U-Net: Enhancing ECG delineation with attention

Mario Sanz-Guerrero, Sergio González-Cabeza, Luis Piñuel, and 4 more authors

Biomedical Signal Processing and Control, 2026

Abs HTML Bib

Cardiovascular diseases are one of the leading causes of death worldwide, making accurate analysis of electrocardiograms (ECGs) critical for early diagnosis and effective treatment. ECG delineation – the precise identification of waveform boundaries and peaks – is essential for clinical assessment but remains a challenging and time-consuming task when performed manually. While previous automated approaches have achieved reasonable performance, they are often limited by their inability to detect waveform peaks, reliance on pre-segmented heartbeats, and lack of generalizability across full-length 12-lead ECGs. This paper presents Peak Attention U-Net, a novel deep learning model for automated ECG delineation with enhanced peak detection capabilities. Building upon the U-Net encoder–decoder architecture, our approach integrates attention gates to selectively focus on salient waveform features, enabling precise identification of P, QRS, and T waves, as well as their respective peaks, across full-length 12-lead ECG signals. We evaluate the model on the LUDB dataset and demonstrate that Peak Attention U-Net achieves state-of-the-art performance in fiducial point delineation, with final F1-scores of 89.26 for P peak, 99.72 for R peak, and 97.51 for T peak, representing significant improvements over the second-best model in the literature (+13.67%, +12.19%, and +11.45%, respectively). The model is lightweight, efficient, and generalizes well to diverse cardiac conditions, supporting real-time clinical applications and deployment in wearable devices. These results demonstrate the effectiveness of the proposed model and its potential to advance automated ECG analysis in biomedical signal processing.
@article{sanz-guerrero2026peakattentionunet, title = {Peak Attention U-Net: Enhancing ECG delineation with attention}, journal = {Biomedical Signal Processing and Control}, volume = {119}, pages = {109874}, year = {2026}, issn = {1746-8094}, doi = {https://doi.org/10.1016/j.bspc.2026.109874}, url = {https://www.sciencedirect.com/science/article/pii/S1746809426004283}, author = {Sanz-Guerrero, Mario and González-Cabeza, Sergio and Piñuel, Luis and {Buelga Suárez}, Mauro Luis and {Alonso Salinas}, Gonzalo Luis and Diaz-Vicente, Marian and Recas, Joaquín}, keywords = {ECG delineation, Deep learning, U-Net, Attention mechanism, Peak detection, Cardiac diagnosis, Artificial intelligence, Biomedical signal processing}, }
AACL’25
Mitigating Label Length Bias in Large Language Models

Mario Sanz-Guerrero and Katharina von der Wense

Nov 2025

Abs arXiv Bib PDF

Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
@misc{sanz-guerrero2025ncc, title = {Mitigating Label Length Bias in Large Language Models}, author = {Sanz-Guerrero, Mario and {von der Wense}, Katharina}, year = {2025}, month = nov, eprint = {2511.14385}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2511.14385}, doi = {10.48550/arXiv.2511.14385}, }
EMNLP’25
Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property Prediction

George Arthur Baker, Mario Sanz-Guerrero, and Katharina von der Wense

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Abs HTML Bib PDF

Large Language Models (LLMs) have demonstrated capabilities for natural language formulations of molecular property prediction tasks, but little is known about how performance depends on the representation of input molecules to the model; the status quo approach is to use SMILES strings, although alternative chemical notations convey molecular information differently, each with their own strengths and weaknesses. To learn more about molecular string representation preferences in LLMs, we compare the performance of four recent models—GPT-4o, Gemini 1.5 Pro, Llama 3.1 405b, and Mistral Large 2—on molecular property prediction tasks from the MoleculeNet benchmark across five different molecular string representations: SMILES, DeepSMILES, SELFIES, InChI, and IUPAC names. We find statistically significant zero- and few-shot preferences for InChI and IUPAC names, potentially due to representation granularity, favorable tokenization, and prevalence in pretraining corpora. This contradicts previous assumptions that molecules should be presented to LLMs as SMILES strings. When these preferences are taken advantage of, few-shot performance rivals or surpasses many previous conventional approaches to property prediction, with the advantage of explainable predictions through chain-of-thought reasoning not held by task-specific models.
@inproceedings{baker2025molecular, title = {Molecular String Representation Preferences in Pretrained {LLM}s: A Comparative Study in Zero- {\&} Few-Shot Molecular Property Prediction}, author = {Baker, George Arthur and Sanz-Guerrero, Mario and {von der Wense}, Katharina}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.56/}, doi = {10.18653/v1/2025.emnlp-main.56}, pages = {1071--1085}, isbn = {979-8-89176-332-6}, }
EMNLP’25
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

Mario Sanz-Guerrero, Minh Duc Bui, and Katharina von der Wense

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025

Abs HTML Bib PDF

When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string “Answer:” to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy – tokenizing the space \textittogether with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model’s confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.
@inproceedings{sanz-guerrero2025mindthegap, title = {Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with {LLM}s}, author = {Sanz-Guerrero, Mario and Bui, Minh Duc and {von der Wense}, Katharina}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.988/}, doi = {10.18653/v1/2025.emnlp-main.988}, pages = {19584--19594}, isbn = {979-8-89176-332-6}, }
NAACL’25 Workshop
Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models

Mario Sanz-Guerrero and Katharina von der Wense

In The Sixth Workshop on Insights from Negative Results in NLP, May 2025

Abs HTML Bib PDF

In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose *corrective in-context learning* (CICL), an approach that incorporates a model’s incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model’s task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.
@inproceedings{cicl_2025, title = {Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models}, author = {Sanz-Guerrero, Mario and {von der Wense}, Katharina}, booktitle = {The Sixth Workshop on Insights from Negative Results in NLP}, month = may, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.insights-1.4/}, doi = {10.18653/v1/2025.insights-1.4}, pages = {24--33}, isbn = {979-8-89176-240-4}, }
Inteligencia Artificial
Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending

Mario Sanz-Guerrero and Javier Arroyo

Inteligencia Artificial, Mar 2025

Abs HTML Bib PDF

Peer-to-peer (P2P) lending connects borrowers and lenders through online platforms but suffers from significant information asymmetry, as lenders often lack sufficient data to assess borrowers’ creditworthiness. This paper addresses this challenge by leveraging BERT, a Large Language Model (LLM) known for its ability to capture contextual nuances in text, to generate a risk score based on borrowers’ loan descriptions using a dataset from the Lending Club platform. We fine-tune BERT to distinguish between defaulted and non-defaulted loans using the loan descriptions provided by the borrowers. The resulting BERT-generated risk score is then integrated as an additional feature into an XGBoost classifier used at the loan granting stage, where decision-makers have limited information available to guide their decisions. This integration enhances predictive performance, with improvements in balanced accuracy and AUC, highlighting the value of textual features in complementing traditional inputs. Moreover, we find that the incorporation of the BERT score alters how classification models utilize traditional input variables, with these changes varying by loan purpose. These findings suggest that BERT discerns meaningful patterns in loan descriptions, encompassing borrower-specific features, specific purposes, and linguistic characteristics. However, the inherent opacity of LLMs and their potential biases underscore the need for transparent frameworks to ensure regulatory compliance and foster trust. Overall, this study demonstrates how LLM-derived insights interact with traditional features in credit risk modeling, opening new avenues to enhance the explainability and fairness of these models.
@article{sanz-guerrero2025credit, title = {Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending}, author = {Sanz-Guerrero, Mario and Arroyo, Javier}, year = {2025}, month = mar, journal = {Inteligencia Artificial}, volume = {28}, number = {75}, pages = {220–247}, url = {https://journal.iberamia.org/index.php/intartif/article/view/1890}, doi = {10.4114/intartif.vol28iss75pp220-247}, }