Researchers have devised a novel method to examine unlabeled audio and visual data to improve machine-learning models for voice recognition and object detection. 

This work combines contrastive learning and masked data modelling, two architectures of self-supervised learning, for the first time. Together, they allow machine learning tasks like event classification in single- and multimodal data to be scaled without annotation.

Contrastive audio-visual masked autoencoder

The contrastive audio-visual masked autoencoder (CAV-MAE) is a neural network trained on 10-second audio and video clips from YouTube to learn to extract and map meaningful latent representations into high-dimensional space from acoustic and visual data. Because it explicitly models the links between auditory and visual data, the strategy is superior to earlier techniques, according to the researchers.

Human learning

Video and its synchronized audio waveform are converted to a spectrogram via masked data modelling (also known as the prediction method), and 75% of each is masked. Following tokenization, the unmasked data is sent to independent audio and visual encoders before being passed to a joint encoder/decoder, where the model is requested to retrieve the missing data. The reconstruction loss is utilized to boost the model's efficiency. 

It is the dissimilarity between the predicted output and the raw data used to reconstruct it. To illustrate this, let's say we want to train a model to identify masked inputs by covering up segments of a video of a piano and a spectrogram of piano music. Unlike contrastive learning, which exploits this, this approach may not capture the video and audio pair association. However, this approach may also need to include modality-unique information, such as the background in a video.

Contrastive learning

The goal of contrastive learning is to place similar representations on the same part of the map. For instance, the model will prioritize pairings of parrot video and audio data by putting them closer together than guitar video and audio data combinations. Audio-visual pairings are split up and sent to different modality encoders, much as masked autoencoding. Still, the audio and visual components are maintained apart within the joint encoder until the model pools the data and applies contrastive loss. By comparing two media sources, contrastive learning looks for similarities between them. 

If a video depicts a person speaking, and an accompanying audio clip also contains speech, the autoencoder can learn to correlate the person's lip movements with the spoken words. It will then tweak the model's settings to make the inputs look more similar. Finally, the CAV-MAE approach integrates both methods into a single framework by employing several forward data streams, masking as a first step, modality-specific encoders, and layer normalization to ensure comparable representation strengths.

Evaluation

Using standard AudioSet (20K and 2M) and VGGSound datasets — labelled, realistic short clips, which could include multiple sounds — the researchers compared CAV-MAE to other state-of-the-art methods on audio-visual retrieval and audio-visual event classification tasks. Both audio-visual retrieval and event categorization need a model's ability to recognize specific types of activity or sound inside data, such as a human voice singing or a motor engine revving.

Results

They discovered that masked data modelling and contrastive learning work well together. The event classification performance of CAV-MAE was around 2% better than that of the prior methodologies (with fully self-supervised pre-training) when compared to models with equivalent computing, and it even kept up with or outperformed models with industry-level computational resources, which is quite an achievement. The group's model performed similarly to contrastive loss-trained models. 

Surprisingly, adding multi-modal data to CAV-MAE pre-training improves audio-only event classification and fine-tunes single-modality representation through supervised learning. It indicates that multi-modal information offers an additional "soft label" boost, just like people do, even for audio or visual-only tasks; for example, it helps the model realize if it's looking for an electric or acoustic guitar, providing a richer supervision signal.

Conclusion

The researchers believe their development of the contrastive audio-visual masked autoencoder (CAV-MAE) marks a significant milestone and advances state of the art for applications that require or benefit from audio-visual fusion as they evolve from single- to multi-modality. They speculate about potential applications in action recognition in sports, classrooms, movies, cars, and police work. One day, it may even be used for additional methods.

Sources of Article

Image source: Unsplash

Want to publish your content?

Get Published Icon
ALSO EXPLORE