Automatic document screening of medical literature using word and text embeddings in an active learning setting

dc.contributor.authorCarvallo, Andres
dc.contributor.authorParra, Denis
dc.contributor.authorLobel, Hans
dc.contributor.authorSoto, Alvaro
dc.date.accessioned2024-01-10T13:44:17Z
dc.date.available2024-01-10T13:44:17Z
dc.date.issued2020
dc.description.abstractDocument screening is a fundamental task within Evidence-based Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches have tried to reduce physicians' workload of screening and labeling vast amounts of documents to answer clinical questions. Previous works tried to semi-automate document screening, reporting promising results, but their evaluation was conducted on small datasets, which hinders generalization. Moreover, recent works in natural language processing have introduced neural language models, but none have compared their performance in EBM. In this paper, we evaluate the impact of several document representations such as TF-IDF along with neural language models (BioBERT, BERT, Word2Vec, and GloVe) on an active learning-based setting for document screening in EBM. Our goal is to reduce the number of documents that physicians need to label to answer clinical questions. We evaluate these methods using both a small challenging dataset (CLEF eHealth 2017) as well as a larger one but easier to rank (Epistemonikos). Our results indicate that word as well as textual neural embeddings always outperform the traditional TF-IDF representation. When comparing among neural and textual embeddings, in the CLEF eHealth dataset the models BERT and BioBERT yielded the best results. On the larger dataset, Epistemonikos, Word2Vec and BERT were the most competitive, showing that BERT was the most consistent model across different corpuses. In terms of active learning, an uncertainty sampling strategy combined with a logistic regression achieved the best performance overall, above other methods under evaluation, and in fewer iterations. Finally, we compared the results of evaluating our best models, trained using active learning, with other authors methods from CLEF eHealth, showing better results in terms of work saved for physicians in the document-screening task.
dc.description.funderANID Chile
dc.description.funderFondecyt Grant
dc.description.funderMillenium Institute Foundational Research on Data (IMFD)
dc.fechaingreso.objetodigital2024-03-13
dc.format.extent38 páginas
dc.fuente.origenWOS
dc.identifier.doi10.1007/s11192-020-03648-6
dc.identifier.eissn1588-2861
dc.identifier.issn0138-9130
dc.identifier.urihttps://doi.org/10.1007/s11192-020-03648-6
dc.identifier.urihttps://repositorio.uc.cl/handle/11534/78881
dc.identifier.wosidWOS:000565497800001
dc.information.autorucFacultad de Ingeniería; Parra Santander, Denis Alejandro; S/I; 1011554
dc.issue.numero3
dc.language.isoen
dc.nota.accesoContenido parcial
dc.pagina.final3084
dc.pagina.inicio3047
dc.publisherSPRINGER
dc.revistaSCIENTOMETRICS
dc.rightsacceso restringido
dc.subjectActive learning
dc.subjectDocument screening
dc.subjectNatural language processing
dc.subjectSYSTEMATIC REVIEWS
dc.subjectCLASSIFICATION
dc.subject.ods03 Good Health and Well-being
dc.subject.odspa03 Salud y bienestar
dc.titleAutomatic document screening of medical literature using word and text embeddings in an active learning setting
dc.typeartículo
dc.volumen125
sipa.codpersvinculados1011554
sipa.indexWOS
sipa.trazabilidadCarga SIPA;09-01-2024
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2024-03-13. Automatic documents screening of medical literature using word and text embeddings in an active learning setting.pdf
Size:
123.61 KB
Format:
Adobe Portable Document Format
Description: