Workshop by Neuraptic Labs.

Author: Marco D’Alessandro

Multimodal Deep Learning is one of the most exciting scientific challenges of our time. It enables machines to build complex and rich representations of multimodal stimuli by capturing shared semantics between uni-modal components, as human brains do. In this workshop, we presented an overview of some of the most interesting multimodal architectures introduced in the past two years in order to review SOTA and future directions of Multimodal Learning, directly by examples.



Workshop Multimodal AI SOTA (PDF) by Neuraptic Labs.

The workshop focused on both single-task and multi-task multimodal models, by first attending to some interesting ideas by Google Research, such as the Perceiver model, a modality-agnostic transformer handling unprocessed data while expressing dimensionality reduction capability, and the Attention Bottleneck architecture, which explicitly models multimodal neurons among modality-specific ones, processing shared semantics in a bottleneck layer.

Therefore, several architecture sub-modules of 3-modality transformer models were analyzed to study the main building blocks of multimodal information processing. In particular, the Audio-Visual Dual-Stream Retrieval model proposed a CLIP-like video retrieval with text supervision, by combining, de facto, video, audio, and text modalities, while the Video-Audio-Text Transformer model (VATT) introduced the DropToken module to reduce the computational complexity when dealing high-dimensional combined audio-video-text tokenized stimuli.

Finally, examples of multi-task multimodal models from Facebook and DeepMind research, among others, were considered. In particular, the Unified Transformer (UniT) model proposed an encoder-decoder model to jointly process a concatenation of encoded uni-modal representations together with a latent encoded representation of a specific task to solve, the One For All (OFA) model, relied on a unified text-image-object vocabulary and a simple encoder-decoder to learn to solve up to 8 tasks, both cross-modal and uni-modal, the Uni-Perceiver model proposed a shared encoder model to learn latent representations of both input and target of an impressive amount of different tasks, by learning to approximate their joint probability, and the most recent Flamingo model introduced the concepts of structured input text interleaved by images, and conditional decoders, while letting the information flow in a combined architecture of learnable and frozen modules.


Multimodal Learning models are highly modular, and concepts like modality fusion and dimensionality reduction can be easily handled by specific building blocks participating in the information processing. Both only-encoder and encoder-decoder models can achieve impressive goals in multi-task learning, by leaving researchers room for flexible architectural choices when hardware or time constraints come into play.

Neuraptic Labs is the technology and research center of Neuraptic AI, developer of ENAIA, Multimodal Machine Learning Operations platform (MMLOps), capable of training any AI no matter how specific the task, able to transform any kind of input (imagen, NLP, tables) and combinations thereof into results.

ENAIA wants any company to be able to have an Artificial Intelligence that is affordable, easy to use and fully adapted to its needs.

About the author: Marco D'Alessandro

About the author: Marco D'Alessandro

Ph.D. in Cognitive Science, Data Scientist and postdoctoral researcher in Computational Cognitive Modeling at Neuraptic AI and the National Research Council of Italy.

Discover ENAIA
and start making
your data

Join our community of partners and gain access to a technology with a huge potential in a market still at its early stages