Researchers at MIT have made a dataset to improve automatic captioning systems that will help people write better chart subtitles. 

Using this tool, researchers could teach a machine-learning model to change the complexity and type of content in a chart caption based on users' needs. They found that dataset-trained auto-captioning models provided accurate, semantically rich captions that mirrored data trends and detailed patterns. Quantitative and qualitative studies show that their models captioned charts more effectively than previous auto-captioning systems. 

Chart captions

Researchers can now use a new tool to build machine-learning algorithms that generate more sophisticated chart captions. Providing captions for previously uncaptioned web charts has the potential to improve accessibility for persons with visual impairments.

The reader needs to understand and remember the data provided that the chart captions explain complex trends and patterns. And for people who can't see, the information in a chart's text is often the only way they can understand it. But it takes a lot of work to write subtitles that are clear and full of information. Even though auto-captioning can help with this, it often has trouble describing cognitive features that give more meaning.

VisText

The researchers' goal is to share the VisText dataset as a tool for researchers to use as they work on the difficult challenge of chart auto-captioning. These automated solutions could aid in providing captions for uncaptioned web charts and improve accessibility for those with visual problems. VisText was inspired by previous work in the Visualisation Group that investigated what constitutes an effective chart caption. Researchers discovered that sighted users and blind or low-vision users showed distinct preferences for the complexity of semantic content in a caption in that study. The team wished to incorporate human-centred analyses into auto-captioning research. To accomplish this, they created VisText, a dataset of charts and captions that could be used to train machine-learning models to generate accurate, semantically rich, and customised captions.

Auto captioning systems

Creating successful auto-captioning systems is a difficult task. Existing machine-learning algorithms frequently attempt to caption charts in the same way that they would an image, yet humans and models see natural images differently than we do charts. Other methods for captioning a chart use the underlying data table rather than the visual content. However, such data tables are frequently unavailable after the charts have been released. Due to the constraints of images and data tables, VisText additionally displays charts as scene graphs. Chart images are converted into scene graphs, which include all the chart data and additional image context.

Dataset

The researchers created a dataset with over 12,000 charts. Each represents a data table, image, scene graph, and associated captions. Each chart has two captions: one that specifies the chart's design (such as the axis ranges) and another that describes statistics, data linkages, and complex trends. The researchers used an automated approach to generate low-level captions and crowdsourced higher-level captions from human workers.

Semantic prefix tuning 

After collecting chart images and captions, the researchers used VisText to build five machine-learning auto-caption models. They intended to see how each representation — image, data table, and scene graph — and their combinations affected the caption's quality. Their findings revealed that models trained with scene graphs outperformed those trained with data tables. The researchers claim that scene graphs may be a more relevant depiction because they are easier to extract from current charts.

They also trained models individually with low-level and high-level captions. They were able to educate the model to modify the intricacy of the caption's content using this strategy, known as semantic prefix tuning. Furthermore, they performed a qualitative assessment of captions generated by their best-performing approach and classified six types of common errors. A directional inaccuracy happens, for example, when a model states a trend is declining when it is increasing.

Conclusion

This fine-grained, thorough qualitative assessment was critical for understanding how the model made its errors. Using quantitative methods, for example, a directional error may incur the same penalty as a repetition error, in which the model repeats the same word or phrase. However, a directional error may be more receptive to a user than a repetition error. These errors illustrate current models' limitations and raise ethical concerns that researchers must consider as they build auto-captioning systems.

Furthermore, Generative machine-learning algorithms like ChatGPT can also hallucinate or provide misleading information. Using these models to caption current charts automatically has clear benefits, but if the charts are captioned wrongly, it could spread false information.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in