Automatic document screening of medical literature using word and text embeddings in an active learning setting

Carvallo, Andres; Parra, Denis; Lobel, Hans; Soto, Alvaro

Automatic document screening of medical literature using word and text embeddings in an active learning setting

dc.contributor.author	Carvallo, Andres
dc.contributor.author	Parra, Denis
dc.contributor.author	Lobel, Hans
dc.contributor.author	Soto, Alvaro
dc.date.accessioned	2024-01-10T13:44:17Z
dc.date.available	2024-01-10T13:44:17Z
dc.date.issued	2020
dc.description.abstract	Document screening is a fundamental task within Evidence-based Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches have tried to reduce physicians' workload of screening and labeling vast amounts of documents to answer clinical questions. Previous works tried to semi-automate document screening, reporting promising results, but their evaluation was conducted on small datasets, which hinders generalization. Moreover, recent works in natural language processing have introduced neural language models, but none have compared their performance in EBM. In this paper, we evaluate the impact of several document representations such as TF-IDF along with neural language models (BioBERT, BERT, Word2Vec, and GloVe) on an active learning-based setting for document screening in EBM. Our goal is to reduce the number of documents that physicians need to label to answer clinical questions. We evaluate these methods using both a small challenging dataset (CLEF eHealth 2017) as well as a larger one but easier to rank (Epistemonikos). Our results indicate that word as well as textual neural embeddings always outperform the traditional TF-IDF representation. When comparing among neural and textual embeddings, in the CLEF eHealth dataset the models BERT and BioBERT yielded the best results. On the larger dataset, Epistemonikos, Word2Vec and BERT were the most competitive, showing that BERT was the most consistent model across different corpuses. In terms of active learning, an uncertainty sampling strategy combined with a logistic regression achieved the best performance overall, above other methods under evaluation, and in fewer iterations. Finally, we compared the results of evaluating our best models, trained using active learning, with other authors methods from CLEF eHealth, showing better results in terms of work saved for physicians in the document-screening task.
dc.description.funder	ANID Chile
dc.description.funder	Fondecyt Grant
dc.description.funder	Millenium Institute Foundational Research on Data (IMFD)
dc.fechaingreso.objetodigital	2024-03-13
dc.format.extent	38 páginas
dc.fuente.origen	WOS
dc.identifier.doi	10.1007/s11192-020-03648-6
dc.identifier.eissn	1588-2861
dc.identifier.issn	0138-9130
dc.identifier.uri	https://doi.org/10.1007/s11192-020-03648-6
dc.identifier.uri	https://repositorio.uc.cl/handle/11534/78881
dc.identifier.wosid	WOS:000565497800001
dc.information.autoruc	Facultad de Ingeniería; Parra Santander, Denis Alejandro; S/I; 1011554
dc.issue.numero	3
dc.language.iso	en
dc.nota.acceso	Contenido parcial
dc.pagina.final	3084
dc.pagina.inicio	3047
dc.publisher	SPRINGER
dc.revista	SCIENTOMETRICS
dc.rights	acceso restringido
dc.subject	Active learning
dc.subject	Document screening
dc.subject	Natural language processing
dc.subject	SYSTEMATIC REVIEWS
dc.subject	CLASSIFICATION
dc.subject.ods	03 Good Health and Well-being
dc.subject.odspa	03 Salud y bienestar
dc.title	Automatic document screening of medical literature using word and text embeddings in an active learning setting
dc.type	artículo
dc.volumen	125
sipa.codpersvinculados	1011554
sipa.index	WOS
sipa.trazabilidad	Carga SIPA;09-01-2024

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2024-03-13. Automatic documents screening of medical literature using word and text embeddings in an active learning setting.pdf
Size:: 123.61 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Artículos de revistas