First-Author Publications

You can find an up to date list on Google Scholar.

Cold-Compress 1.0: A Hackable Toolkit for KV-Cache Compression

Venue: Answer.ai blogpost, 2024

Excerpt:

Cold Compress, an open source toolkit for KV Cache compression research, unifies SOTA methods like Heavy Hitters, SWA, and Attention Sinks under a flexible, modular API. Built on GPT-Fast for optimal performance, it supports customizable strategies across layers and attention heads, offering a lightweight yet powerful platform for researchers to implement advanced compression techniques. Cold Compress maintains the impressive performance of GPT-Fast by ensuring that all KV cache operations are static and torch compilable. The end result is a toolkit that strikes a balance between simplicity and performance, making it both accessible to all and performant enough for experimentation.

SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval

Venue: CoLM, 2024

Excerpt:

We fine-tune opensource LLMs (Mistral-7B-Instruct and Zephyr-7B-β) on the task and find that they generate incomplete and unfaithful summaries. To increase entity coverage, we train a smaller, encoder-only model to predict salient entities, which are treated as content-plans to guide the LLM. To encourage the LLM to focus on specific mentions in the source notes, we propose SPEER: Sentence-level Planning via Embedded Entity Retrieval. Specifically, we mark each salient entity span with special “{{ }}” boundary tags and instruct the LLM to retrieve marked spans before generating each sentence. Sentence-level planning acts as a form of state tracking in that the model is explicitly recording the entities it uses. We fine-tune Mistral and Zephyr variants on a large-scale, diverse dataset of ~167k in-patient hospital admissions and evaluate on 3 datasets. SPEER shows gains in both coverage and faithfulness metrics over non-guided and guided baselines.

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Venue: EMNLP (New Frontiers in Summarization Workshop), 2023

Excerpt:

Our primary contributions are to: Develop a prompt-based iterative method (CoD) for making summaries increasingly entity dense; Conduct both human and automatic evaluation of increasingly dense summaries on CNN/Dailymail articles to better understand the tradeoff between informativeness (favoring more entities) and clarity (favoring fewer entities); Open source GPT-4 summaries, annotations, and a set of 5,000 unannotated CoD summaries to be used for evaluation or distillation.

Generating EDU Extracts for Plan-Guided Summary Re-Ranking

Venue: ACL, 2023

Excerpt:

A standard language model (a BART LM) auto-regressively generates elemental discourse unit (EDU) content plans with an extractive copy mechanism. The top K beams from the content plan generator are then used to guide a separate LM, which produces a single abstractive candidate for each distinct plan. We apply an existing re-ranker (BRIO) to abstract candidates generated from our method, as well as baseline decoding methods, and show improved relevance metrics (ROUGE and BERTScore) for top ranked summaries on widely used single document news article corpora (CNN / Dailymail, NYT, Xsum). A human evaluation on CNN/DM validates these results.

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization

Venue: ACL (oral spotlight), 2023

Excerpt:

On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between ranked candidates should be maximized and surprise minimized.

A Meta-Evaluation of Faithfulness Metrics for Long-Form Hospital-Course Summarization

Venue: Machine Learning for Healthcare (MLHC), 2023

Excerpt:

We benchmark faithfulness metrics against fine-grained human annotations for model-generated summaries of a patient’s Brief Hospital Course. We meta-evaluate a broad set of proposed faithfulness metrics and, across metrics, explore the importance of domain adaptation (e.g. the impact of in-domain pre-training and metric fine-tuning), the use of source-summary alignments, and the effects of distilling a single metric from an ensemble of pre-existing metrics. Off-the-shelf metrics with no exposure to clinical text correlate well yet overly rely on summary extractiveness. As a practical guide to long-form clinical narrative summarization, we find that most metrics correlate best to human judgments when provided with one summary sentence at a time and a minimal set of relevant source context.

Learning to Revise References for Faithful Summarization

Venue: Findings of EMNLP, 2022

Excerpt:

To improve reference quality while retaining all data, we propose a new approach: to selectively rewrite unsupported reference sentences to better reflect source data. We automatically generate a synthetic dataset of positive and negative revisions by corrupting supported sentences and learn to revise reference sentences with contrastive learning. The intensity of revisions is treated as a controllable attribute so that, at inference, diverse candidates can be over-generated-then-rescored to balance faithfulness and abstraction.

Whats in a Summary? Laying the Groundwork for Advances in Hospital-Course Summarization

Venue: NAACL, 2021

Excerpt:

Exploratory analyses reveal that the the Brief Hospital Course section of the discharge summary is highly abstractive with some long extracted fragments; is concise yet comprehensive; differs in style and content organization from the source notes; exhibits minimal lexical cohesion; and represents silver-standard references for summarization.

Clinical Acronym Expansion via Latent Meaning Cells

Venue: Machine Learning for Health (ML4H) - NeurIPS Workshop, 2020

Excerpt:

We introduce Latent Meaning Cells, a deep latent variable model which learns contextualized representations of words by combining local lexical context and metadata. We evaluate the LMC model on the task of zero-shot clinical acronym expansion across three datasets. The LMC significantly outperforms a diverse set of baselines at a fraction of the pre-training cost and learns clinically coherent representations. We demonstrate that not only is metadata itself very helpful for the task, but that the LMC inference algorithm provides an additional large benefit.

Griffin Adams

First-Author Publications