Large language models assist in understanding clinical notes. For example, researchers utilized a robust deep-learning model to extract vital information from electronic health records that could aid in tailored therapy. The extraction of essential factors hidden in clinical notes has been a long-term objective of the clinical NLP community. However, obstacles include the migration of datasets from the public domain and the unavailability of general clinical corpora and annotations.

In this study, the researchers demonstrate that big language models, such as InstructGPT, perform well at zero- and few-shot information extraction from clinical literature, despite needing to be exceptionally trained for the clinical domain. The researchers also provide new benchmark datasets for few-shot clinical information extraction based on a manual re-annotation of the CASI dataset for new tasks. The GPT-3 systems significantly outperform previous zero- and few-shot baselines for clinical extraction tasks.

Large language models (LLMs) have been under development for some time by experts, but their popularity skyrocketed after GPT-3's sentence completion ability made headlines. While smaller models, such as older generations of GPT or BERT, have achieved good performance for extracting medical data, this is at the expense of a large amount of manual data-labelling labour.

Limitations

Large language models are limited despite their considerable potential for clinical information extraction.

First, because clinical annotation requirements are frequently many pages long, it currently requires more work to direct an LLM to follow an exact schema. Even though the Resolved GPT-3 outputs were outstanding qualitatively, the researchers discovered that they only sometimes matched at the token level. For instance, one Resolved GPT-3 report for tagging durations read "X weeks" rather than "for X weeks." Even though this omission is insignificant, it illustrates how challenging it is to convey complex rules.

Second, they discovered a bias in GPT-3 that causes it to produce a non-trivial response even when there isn't one. Create a list of the prescriptions listed and indicate whether they are active, discontinued, or neither. For instance, in the prompt, we ultimately used medication extraction. The two independent prompts, "Create a bulleted list of active drugs, if any," and "Create a bulleted list of discontinued medications, if any," were, however, the subjects of our earlier experiments. The respective LLM outputs would be accurate if only one current drug and one stopped medication were available. However, the LLM primed with the discontinuation prompt tended to try to find output and typically resorted to listing one or more active drugs if there were two active prescriptions and none ceased.

It is crucial to design assignments or suggestions that avoid falling into this trap. It could be done, for instance, by 

(i) chaining several prompts together, such as by first asking if a specific entity type is present in the input before requesting a list, or 

(ii) by using an output structure like the sequence tagging method. 

Finally, They generated all activities other than extracting biomedical evidence from the publicly-available CASI dataset because of the data use limitations on the majority of existing clinical datasets, which preclude openly sharing the data (for example, to the GPT-3 APIs).

Furthermore, the clinical text in CASI was compiled from notes from various hospitals and specializations, although it is only sometimes an accurate representation of all clinical material. For instance, although this method is not always used, the CASI study claims that the notes were "mainly verbally spoken and typed." Additionally, as is regrettably usual in clinical NLP, the researchers only tested in English, leaving future studies to examine LLMs' capacity to function in other languages.

Conclusion

In this study, the authors offered new annotated datasets to demonstrate that: 

(i) big language models have tremendous potential for a variety of clinical extraction tasks and 

(ii) we can lead generations to map to complex output spaces with minimal post-processing. 

The researchers also demonstrated how task-specific, simpler, more deployable models could be trained using poor supervision over the system's outputs. Critical approaches include testing with LLMs such as OPT, for which we can execute inference locally, enabling evaluation of existing benchmarks and fine-tuning. The scope of clinical NLP extends beyond what we investigated here.

Furthermore, using the outputs of many prompts to identify when GPT-3 is unsure is another significant path; this more excellent dependability will be crucial given the high stakes involved in clinical information extraction. Their work suggests a new paradigm for clinical information extraction that can scale to meet the ambitious ambitions of clinical NLP.

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE