Project NLP: Building a Benchmark Dataset for Extracting Drug Prescription from Clinical Notes

Project NLP: Building a Benchmark Dataset for Extracting Drug Prescription from Clinical Notes

Room 129 / LERNER Team : Ivan Lerner, Julien Tourille, Arnaud Serret-Larmande, Sanjay Kamath, Nguyen Dinh-Phong, Charlotte Rudnik, Camille Petri.

Extracting information from clinical notes, such as “”Patient 001 took medication B last night”” to structured data Patient 001 | drug_B | 8/12/2010 is a non-trivial task for NLP. Automatic extraction of drug prescription from clinical notes can be of great use for datasets where there is no prescription table available. Developing the best algorithm for such tasks is a real challenge and requires benchmark datasets such as the Thyme corpus. However the Thyme corpus  only covers oncology, and it would be interesting to have a similar dataset for the ICU.
Typically, benchmark datasets for information extraction are constructed by manually anotating thousands of clinical notes, which is time consuming. It allows to assess precision and recall of algorithms extracting information from text as compared to human performances.
In MIMIC, drug prescription information are available with daily temporal precision in the prescription table, and exact temporal precision in input_events tables. We could use the information from these prescriptions tables to automatically anotate the clinical notes with the drug entitiy, the temporal expression and the relation between the two. We should then manually check on a small subset the quality of the automatic annotation, if it is not satisfaying we could still use it as a pre-annotated dataset.
This dataset could be the biggest since automatically produced with 2 millions notes in mimic ! If we have time we could have fun training very large models of stacked bi-directionnal LSTMs 🙂

> > join this project < <