Publications
More details about my publications can be found on my Google Scholar profile.
2026
- BSPCPerformance evaluation of smartwatches: Can they match clinical standards for ECG analysis?Mauro Luis Buelga Suárez, Joaquín Recas, Sergio González-Cabeza, and 5 more authorsBiomedical Signal Processing and Control, 2026
Smartwatches have gained popularity in health monitoring. While initially focused on general health and wellness, recent advancements have enabled these devices to acquire electrocardiogram (ECG) signals, opening up new avenues for remote cardiovascular health monitoring. Notably, they have demonstrated efficacy in detecting conditions such as atrial fibrillation. However, the accuracy of these devices in capturing a broader range of clinically relevant ECG parameters remains uncertain. This study evaluated the accuracy of four popular smartwatch models (Apple Watch Series 9, Samsung Galaxy Watch 6, Fitbit Sense 2, and Withings ScanWatch) in acquiring ECG signals using the standardized testing protocol required for medical electrocardiograph certification. A patient simulator (METRON PS-440) was employed to generate standardized ECG waveforms, which were sequentially recorded by the smartwatches and a reference electrocardiograph (Philips TC30). The devices were assessed by comparing their measurements to the reference standard for key ECG parameters, including heart rate, R-wave amplitude, ST-segment analysis, and response to different waveform types and ranges. Results indicated that all devices exhibited similar patterns to the reference ECG in normal sinus rhythm. Nevertheless, variations were observed in R-wave amplitude and J-point offset measurements, with the Withings device demonstrating the most significant deviations. The Samsung device struggled with heart rates exceeding 100 beats per minute. The Apple Watch and Fitbit Sense 2 demonstrated the most promising performance, suggesting their potential for broader clinical applications beyond basic heart rate monitoring. These devices could be useful in detecting arrhythmias and ischemic heart disease, particularly in remote or resource-constrained settings.
@article{buelga-suarez2026smartwatch, title = {Performance evaluation of smartwatches: Can they match clinical standards for ECG analysis?}, journal = {Biomedical Signal Processing and Control}, volume = {115}, pages = {109373}, year = {2026}, issn = {1746-8094}, doi = {https://doi.org/10.1016/j.bspc.2025.109373}, url = {https://www.sciencedirect.com/science/article/pii/S1746809425018841}, author = {{Buelga Suárez}, Mauro Luis and Recas, Joaquín and González-Cabeza, Sergio and Sanz-Guerrero, Mario and Diaz-Vicente, Marian and {Rebolleda Sánchez}, Alfonso and Piñuel, Luis and {Alonso Salinas}, Gonzalo Luis}, keywords = {Electrocardiography, Smartwatch, Heart rate, ST-segment, Ischemia, Ambulatory}, }
2025
- Inteligencia ArtificialCredit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P LendingMario Sanz-Guerrero and Javier ArroyoInteligencia Artificial, Mar 2025
Peer-to-peer (P2P) lending connects borrowers and lenders through online platforms but suffers from significant information asymmetry, as lenders often lack sufficient data to assess borrowers’ creditworthiness. This paper addresses this challenge by leveraging BERT, a Large Language Model (LLM) known for its ability to capture contextual nuances in text, to generate a risk score based on borrowers’ loan descriptions using a dataset from the Lending Club platform. We fine-tune BERT to distinguish between defaulted and non-defaulted loans using the loan descriptions provided by the borrowers. The resulting BERT-generated risk score is then integrated as an additional feature into an XGBoost classifier used at the loan granting stage, where decision-makers have limited information available to guide their decisions. This integration enhances predictive performance, with improvements in balanced accuracy and AUC, highlighting the value of textual features in complementing traditional inputs. Moreover, we find that the incorporation of the BERT score alters how classification models utilize traditional input variables, with these changes varying by loan purpose. These findings suggest that BERT discerns meaningful patterns in loan descriptions, encompassing borrower-specific features, specific purposes, and linguistic characteristics. However, the inherent opacity of LLMs and their potential biases underscore the need for transparent frameworks to ensure regulatory compliance and foster trust. Overall, this study demonstrates how LLM-derived insights interact with traditional features in credit risk modeling, opening new avenues to enhance the explainability and fairness of these models.
@article{sanz-guerrero2025credit, title = {Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending}, author = {Sanz-Guerrero, Mario and Arroyo, Javier}, year = {2025}, month = mar, journal = {Inteligencia Artificial}, volume = {28}, number = {75}, pages = {220–247}, url = {https://journal.iberamia.org/index.php/intartif/article/view/1890}, doi = {10.4114/intartif.vol28iss75pp220-247}, } - arXivAsking Again and Again: Exploring LLM Robustness to Repeated QuestionsSagi Shaier, Mario Sanz-Guerrero, and Katharina von der WenseMar 2025
This study investigates whether repeating questions within prompts influences the performance of large language models (LLMs). We hypothesize that reiterating a question within a single prompt might enhance the model’s focus on key elements of the query. We evaluate five recent LLMs – including GPT-4o-mini, DeepSeek-V3, and smaller open-source models – on three reading comprehension datasets under different prompt settings, varying question repetition levels (1, 3, or 5 times per prompt). Our results demonstrate that question repetition can increase models’ accuracy by up to 6%. However, across all models, settings, and datasets, we do not find the result statistically significant. These findings provide insights into prompt design and LLM behavior, suggesting that repetition alone does not significantly impact output quality.
@misc{shaier2025asking, title = {Asking Again and Again: Exploring LLM Robustness to Repeated Questions}, author = {Shaier, Sagi and Sanz-Guerrero, Mario and {von der Wense}, Katharina}, year = {2025}, month = mar, eprint = {2412.07923}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2412.07923}, doi = {10.48550/arXiv.2412.07923}, } - NAACL’25 WorkshopCorrective In-Context Learning: Evaluating Self-Correction in Large Language ModelsMario Sanz-Guerrero and Katharina von der WenseIn The Sixth Workshop on Insights from Negative Results in NLP, May 2025
In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose *corrective in-context learning* (CICL), an approach that incorporates a model’s incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model’s task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.
@inproceedings{cicl_2025, title = {Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models}, author = {Sanz-Guerrero, Mario and {von der Wense}, Katharina}, booktitle = {The Sixth Workshop on Insights from Negative Results in NLP}, month = may, year = {2025}, address = {Albuquerque, New Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.insights-1.4/}, doi = {10.18653/v1/2025.insights-1.4}, pages = {24--33}, isbn = {979-8-89176-240-4}, } - Med. Eng. Phys.Reducing leads, enhancing wearable practicality: A comparative study of 3-lead vs. 12-lead ECG classificationSergio González-Cabeza, Mario Sanz-Guerrero, Luis Piñuel, and 4 more authorsMedical Engineering & Physics, Nov 2025
Inspired by recent advances in clinical research and the growing adoption of wearable ECG devices, this study explores the feasibility of using reduced-lead ECGs for automated detection of heart anomalies using deep learning, providing a more accessible and cost-effective alternative to traditional 12-lead ECGs. This research adapts and evaluates a state-of-the-art 12-lead deep learning model (from Ribeiro et al. [1]) for 3-lead configurations. The 12-lead ECG model architecture was trained from scratch on the public database PTB-XL. It was then modified to use 3 leads by only changing the input layer. Despite a 75% reduction in input data, the 3-lead model showed only a subtle 3% performance drop. To address this gap, the 3-lead model was further optimized using a novel strategy that combines transfer learning and a One-vs-All classification approach. Using PTB-XL’s five-class setup (normal vs. four pathologies: myocardial infarction, ST/T change, conduction disturbance, and hypertrophy), we report the micro-averaged F1-score across all test samples. The new optimized 3-lead model achieves a global (micro-averaged) F1-score of 77% (vs. 78% for the 12-lead model). These findings highlight the potential of simplified and cost-effective reduced-lead classification models to deliver near-equivalent diagnostic accuracy. This advancement could democratize access to early cardiac diagnostics, particularly in resource-limited settings.
@article{3_leads_2025, title = {Reducing leads, enhancing wearable practicality: A comparative study of 3-lead vs. 12-lead ECG classification}, journal = {Medical Engineering & Physics}, volume = {145}, pages = {104419}, year = {2025}, month = nov, issn = {1350-4533}, doi = {https://doi.org/10.1016/j.medengphy.2025.104419}, url = {https://www.sciencedirect.com/science/article/pii/S1350453325001389}, author = {González-Cabeza, Sergio and Sanz-Guerrero, Mario and Piñuel, Luis and {Buelga Suárez}, Mauro Luis and {Alonso Salinas}, Gonzalo Luis and Diaz-Vicente, Marian and Recas, Joaquín}, keywords = {Deep learning, Electrocardiography, One-vs-All classification, Reduced-lead ECG, Transfer learning}, } - EMNLP’25Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMsMario Sanz-Guerrero, Minh Duc Bui, and Katharina von der WenseIn Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025
When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string “Answer:” to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy – tokenizing the space \textittogether with the answer letter – as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model’s confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.
@inproceedings{sanz-guerrero2025mindthegap, title = {Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with {LLM}s}, author = {Sanz-Guerrero, Mario and Bui, Minh Duc and {von der Wense}, Katharina}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.988/}, doi = {10.18653/v1/2025.emnlp-main.988}, pages = {19584--19594}, isbn = {979-8-89176-332-6}, } - EMNLP’25Molecular String Representation Preferences in Pretrained LLMs: A Comparative Study in Zero- & Few-Shot Molecular Property PredictionGeorge Arthur Baker, Mario Sanz-Guerrero, and Katharina von der WenseIn Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov 2025
Large Language Models (LLMs) have demonstrated capabilities for natural language formulations of molecular property prediction tasks, but little is known about how performance depends on the representation of input molecules to the model; the status quo approach is to use SMILES strings, although alternative chemical notations convey molecular information differently, each with their own strengths and weaknesses. To learn more about molecular string representation preferences in LLMs, we compare the performance of four recent models—GPT-4o, Gemini 1.5 Pro, Llama 3.1 405b, and Mistral Large 2—on molecular property prediction tasks from the MoleculeNet benchmark across five different molecular string representations: SMILES, DeepSMILES, SELFIES, InChI, and IUPAC names. We find statistically significant zero- and few-shot preferences for InChI and IUPAC names, potentially due to representation granularity, favorable tokenization, and prevalence in pretraining corpora. This contradicts previous assumptions that molecules should be presented to LLMs as SMILES strings. When these preferences are taken advantage of, few-shot performance rivals or surpasses many previous conventional approaches to property prediction, with the advantage of explainable predictions through chain-of-thought reasoning not held by task-specific models.
@inproceedings{baker2025molecular, title = {Molecular String Representation Preferences in Pretrained {LLM}s: A Comparative Study in Zero- {\&} Few-Shot Molecular Property Prediction}, author = {Baker, George Arthur and Sanz-Guerrero, Mario and {von der Wense}, Katharina}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.56/}, doi = {10.18653/v1/2025.emnlp-main.56}, pages = {1071--1085}, isbn = {979-8-89176-332-6}, } - WMT’25JGU Mainz’s Submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: MT and QAHossain Shaikh Saadi, Minh Duc Bui, Mario Sanz-Guerrero, and 1 more authorIn Proceedings of the Tenth Conference on Machine Translation, Nov 2025
This paper presents the JGU Mainz submission to the WMT25 Shared Task on LLMs with Limited Resources for Slavic Languages: Machine Translation and Question Answering, focusing on Ukrainian, Upper Sorbian, and Lower Sorbian. For each language, we jointly fine-tune a Qwen2.5-3B-Instruct model for both tasks with parameter-efficient finetuning. Our pipeline integrates additional translation and multiple-choice question answering (QA) data. For Ukrainian QA, we further use retrieval-augmented generation. We also apply ensembling for QA in Upper and Lower Sorbian. Experiments show that our models outperform the baseline on both tasks.
@inproceedings{saadi-etal-2025-jgu, title = {{JGU} Mainz{'}s Submission to the {WMT}25 Shared Task on {LLM}s with Limited Resources for {S}lavic Languages: {MT} and {QA}}, author = {Saadi, Hossain Shaikh and Bui, Minh Duc and Sanz-Guerrero, Mario and {von der Wense}, Katharina}, booktitle = {Proceedings of the Tenth Conference on Machine Translation}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.wmt-1.89/}, doi = {10.18653/v1/2025.wmt-1.89}, pages = {1151--1157}, isbn = {979-8-89176-341-8}, } - AACL’25Mitigating Label Length Bias in Large Language ModelsMario Sanz-Guerrero and Katharina von der WenseNov 2025
Large language models (LLMs) are powerful zero- and few-shot learners. However, when predicting over a set of candidate options, LLMs suffer from label biases, and existing calibration methods overlook biases arising from multi-token class labels. We tackle an issue we call label length bias, where labels of different lengths are treated inconsistently, even after standard length normalization. To mitigate it, we propose normalized contextual calibration (NCC), an effective method that normalizes and calibrates predictions at the full-label level. NCC achieves statistically significant improvements over prior approaches across multiple datasets and models, with gains of up to 10% F1. Moreover, NCC extends bias mitigation to broader tasks such as multiple-choice question answering. Our analysis shows that, when combined with in-context learning, NCC is less sensitive to few-shot example selection, requires fewer examples for competitive performance, and produces more reliable confidence estimates. These findings highlight the importance of mitigating full-label biases to improve the performance and robustness of LLM-based methods, particularly in real-world applications where class labels naturally consist of multiple tokens.
@misc{sanz-guerrero2025ncc, title = {Mitigating Label Length Bias in Large Language Models}, author = {Sanz-Guerrero, Mario and {von der Wense}, Katharina}, year = {2025}, month = nov, eprint = {2511.14385}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2511.14385}, doi = {10.48550/arXiv.2511.14385}, } - AACL’25 WorkshopNALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code GenerationHossain Shaikh Saadi, Faria Alam, Mario Sanz-Guerrero, and 3 more authorsNov 2025
This paper presents JGU Mainz’s winning system for the BLP-2025 Shared Task on Code Generation from Bangla Instructions. We propose a multi-agent-based pipeline. First, a code-generation agent produces an initial solution from the input instruction. The candidate program is then executed against the provided unit tests (pytest-style, assert-based). Only the failing cases are forwarded to a debugger agent, which reruns the tests, extracts error traces, and, conditioning on the error messages, the current program, and the relevant test cases, generates a revised solution. Using this approach, our submission achieved first place in the shared task with a Pass@1 score of 95.4. We also make our code public.
@misc{saadi2025blp, title = {NALA_MAINZ at BLP-2025 Task 2: A Multi-agent Approach for Bangla Instruction to Python Code Generation}, author = {Saadi, Hossain Shaikh and Alam, Faria and Sanz-Guerrero, Mario and Bui, Minh Duc and Mager, Manuel and {von der Wense}, Katharina}, year = {2025}, month = nov, eprint = {2511.16787}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, url = {https://arxiv.org/abs/2511.16787}, doi = {10.48550/arXiv.2511.16787}, }