Browsing by Author "Dunstan Escudero, Jocelyn Mariel"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
- ItemA pseudonymized corpus of occupational health narratives for clinical entity recognition in Spanish(Springer Nature, 2024) Dunstan Escudero, Jocelyn Mariel; Vakili, Thomas; Miranda Huerta, Luis Alberto; Villena, Fabián; Aracena, Claudio; Quiroga Curin, Tamara Nancy; Vera, Paulina; Viteri Valenzuela, Sebastián; Rocco, VictorDespite the high creation cost, annotated corpora are indispensable for robust natural language processing systems. In the clinical field, in addition to annotating medical entities, corpus creators must also remove personally identifiable information (PII). This has become increasingly important in the era of large language models where unwanted memorization can occur. This paper presents a corpus annotated to anonymize personally identifiable information in 1,787 anamneses of work-related accidents and diseases in Spanish. Additionally, we applied a previously released model for Named Entity Recognition (NER) trained on referrals from primary care physicians to identify diseases, body parts, and medications in this work-related text. We analyzed the differences between the models and the gold standard curated by a physician in detail. Moreover, we compared the performance of the NER model on the original narratives, in narratives where personal information has been masked, and in texts where the personal data is replaced by another similar surrogate value (pseudonymization). Within this publication, we share the annotation guidelines and the annotated corpus.
- ItemAutomatic detection of distant metastasis mentions in radiology reports in spanish(American Society of Clinical Oncology, 2024) Ahumada, Ricardo; Dunstan Escudero, Jocelyn Mariel; Rojas, Matías; Peñafiel, Sergio; Paredes, Inti; Báez, PabloA critical task in oncology is extracting information related to cancer metastasis from electronic health records. Metastasis-related information is crucial for planning treatment, evaluating patient prognoses, and cancer research. However, the unstructured way in which findings of distant metastasis are often written in radiology reports makes it difficult to extract information automatically. The main aim of this study was to extract distant metastasis findings from free-text imaging and nuclear medicine reports to classify the patient status according to the presence or absence of distant metastasis. MATERIALS AND METHODS: We created a distant metastasis annotated corpus using positron emission tomography-computed tomography and computed tomography reports of patients with prostate, colorectal, and breast cancers. Entities were labeled M1 or M0 according to affirmative or negative metastasis descriptions. We used a named entity recognition model on the basis of a bidirectional long short-term memory model and conditional random fields to identify entities. Mentions were subsequently used to classify whole reports into M1 or M0. RESULTS: The model detected distant metastasis mentions with a weighted average F1 score performance of 0.84. Whole reports were classified with an F1 score of 0.92 for M0 documents and 0.90 for M1 documents. CONCLUSION: These results show the usefulness of the model in detecting distant metastasis findings in three different types of cancer and the consequent classification of reports. The relevance of this study is to generate structured distant metastasis information from free-text imaging reports in Spanish. In addition, the manually annotated corpus, annotation guidelines, and code are freely released to the research community.
- ItemAutomatic knowledge-graph creation from historical documents: The Chilean dictatorship as a case study(2024) Díaz, Camila; Dunstan Escudero, Jocelyn Mariel; Etcheverry, Lorena; Fonck Larraín, Antonia; Grez, Alejandro; Mery Quiroz, Domingo Arturo; Reutter de la Maza, Juan Lorenzo; Rojas, HugoWe present our results regarding the construction of a knowledge graph from historical documents related to the Chilean dictatorship period (1973-1990). Our approach uses LLMs to automatically recognize entities and relations between them and resolve conflicts between these values. To prevent hallucination, the interaction with the LLM is grounded in a simple ontology with four types of entities and seven types of relations. To evaluate our architecture, we use a gold standard graph constructed using a small subset of the documents, and compare this to the graph obtained from our approach when processing the same set of documents. Results show that the automatic construction manages to recognize a good portion of all the entities in the gold standard and that those not recognized are explained mainly by the level of granularity in which the information is structured in the graph and not because the automatic approach misses an important entity in the graph. Looking forward, we expect this report to encourage work on other similar projects focused on enhancing research in humanities and social science. However, we remark that better evaluation metrics are needed to accurately fine-tune these types of architectures.
- ItemClinical analogy resolution performance for foundation language models(2024) Villena, Fabián; Quiroga Curin, Tamara Nancy; Dunstan Escudero, Jocelyn MarielUsing extensive data sources to create foundation language models has revolutionized the performance of deep learning-based architectures. This remarkable improvement has led to state-of-the-art results for various downstream NLP tasks, including clinical tasks. However, more research is needed to measure model performance intrinsically, especially in the clinical domain. We revisit the use of analogy questions as an effective method to measure the intrinsic performance of language models for the clinical domain in English. We tested multiple Transformers-based language models over analogy questions constructed from the Unified Medical Language System (UMLS), a massive knowledge graph of clinical concepts. Our results show that large language models are significantly more performant for analogy resolution than small language models. Similarly, domain-specific language models perform better than general domain language models. We also found a correlation between intrinsic and extrinsic performance, validated through PubMedQA extrinsic task. Creating clinical-specific and language-specific language models is essential for advancing biomedical and clinical NLP and will ensure a valid application in clinical practice. Finally, given that our proposed intrinsic test is based on a term graph available in multiple languages, the dataset can be built to measure the performance of models in languages other than English.
- ItemDeveloping and Validating an Automatic Support System for Tumor Coding in Pathology Reports in Spanish(2025) Villena, Fabián; Báez, Pablo; Peñafiel, Sergio; Rojas, Matías; Paredes, Inti; Dunstan Escudero, Jocelyn MarielPathology reports provide valuable information for cancer registries to understand, plan, and implement strategies to mitigate the impact of cancer. However, coding essential information from unstructured reports is performed by experts in a time-consuming manual process. We developed and validated a novel two-step automatic coding system that first recognizes tumor morphology and topography mentions from free text and then suggests codes from the International Classification of Diseases for Oncology (ICD-O) in Spanish.MATERIALS AND METHODSWe created an annotated corpus of tumor morphology and topography mentions consisting of 1,101 documents. We combined it with the CANTEMIST corpus (Cancer Text Mining Shared Task). Specifically, we implemented a named entity recognition (NER) model using the bidirectional long short-term memory network-conditional random field architecture enhanced with a stacked embedding layer. We applied transfer learning from state-of-the-art pretrained language models to obtain high-quality contextual representations, thus improving the detection of entities. The mentions found using this model were subsequently coded using a search engine tailored to the ICD-O codes.RESULTSOur NER models achieved an F1 score of 0.86 and 0.90 for tumor morphology and topography, respectively. The overall performance of our automatic coding system achieved an accuracy at five suggestions of 0.72 and 0.65 for tumor morphology and topography, respectively.CONCLUSIONThese results demonstrate the feasibility of implementing natural language processing tools in the routine of a cancer center to extract and code valuable information from pathology reports. Our recommender system allows reliable and transparent coding at the moment of consultation. This publication shares the annotated corpus in Spanish, annotation guidelines, and source code to reproduce our experiments.
- ItemNLP modeling recommendations for restricted data availability in clinical settings(2025) Villena, Fabián; Bravo-Marquez, Felipe; Dunstan Escudero, Jocelyn MarielBackground Clinical decision-making in healthcare often relies on unstructured text data, which can be challenging to analyze using traditional methods. Natural Language Processing (NLP) has emerged as a promising solution, but its application in clinical settings is hindered by restricted data availability and the need for domain-specific knowledge. Methods We conducted an experimental analysis to evaluate the performance of various NLP modeling paradigms on multiple clinical NLP tasks in Spanish. These tasks included referral prioritization and referral specialty classification. We simulated three clinical settings with varying levels of data availability and evaluated the performance of four foundation models. Results Clinical-specific pre-trained language models (PLMs) achieved the highest performance across tasks. For referral prioritization, Clinical PLMs attained an 88.85 % macro F1 score when fine-tuned. In referral specialty classification, the same models achieved a 53.79 % macro F1 score, surpassing domain-agnostic models. Continuing pre-training with environment-specific data improved model performance, but the gains were marginal compared to the computational resources required. Few-shot learning with large language models (LLMs) demonstrated lower performance but showed potential in data-scarce scenarios. Conclusions Our study provides evidence-based recommendations for clinical NLP practitioners on selecting modeling paradigms based on data availability. We highlight the importance of considering data availability, task complexity, and institutional maturity when designing and training clinical NLP models. Our findings can inform the development of effective clinical NLP solutions in real-world settings.
- ItemPhysics-informed neural networks for parameter estimation in blood flow models(2024) Garay Labra, Jeremías Esteban; Dunstan Escudero, Jocelyn Mariel; Uribe Arancibia, Sergio Andrés; Sahli Costábal, FranciscoBackground: Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving inverse problems, especially in cases where no complete information about the system is known and scatter measurements are available. This is especially useful in hemodynamics since the boundary information is often difficult to model, and high-quality blood flow measurements are generally hard to obtain. Methods: In this work, we use the PINNs methodology for estimating reduced-order model parameters and the full velocity field from scatter 2D noisy measurements in the aorta. Two different flow regimes, stationary and transient were studied. Results: We show robust and relatively accurate parameter estimations when using the method with simulated data, while the velocity reconstruction accuracy shows dependence on the measurement quality and the flow pattern complexity. Comparison with a Kalman filter approach shows similar results when the number of parameters to be estimated is low to medium. For a higher number of parameters, only PINNs were capable of achieving good results. Conclusion: The method opens a door to deep-learning-driven methods in the simulations of complex coupled physical systems.
- ItemResponse to Kempf et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model(American Society of Clinical Oncology, 2024) Ahumada, Ricardo; Dunstan Escudero, Jocelyn Mariel; Paredes, Inti; Baez, Pablo