Understanding Multimodal Large Language Models: Feature Extraction and Modality-Specific Encoders

Understanding how multimodal Large Language Models (LLMs) integrate text, image, video, and audio features into a shared embedding space is key to leveraging their full potential. This blog delves into the architectural intricacies that enable these models to seamlessly process diverse data types.

Rawad Hilal

12 Aug 2024 — 7 min read

Multimodal Large Language Models

Introduction

Large Language Models (LLMs) like GPT, BERT, and their multimodal counterparts have transformed the way we interact with artificial intelligence. The ability to process not only text but also images, videos, and audio opens up a new frontier of possibilities. These models, known as multimodal LLMs, integrate various data types into a single, shared embedding space. But how do they achieve this? In this blog post, we'll explore the architectural nuances behind multimodal LLMs, focusing on feature extraction, modality-specific encoders, and the process of generating tokens that reside in the same embedding space.

The Challenge of Multimodal Data

Traditional LLMs are designed to work primarily with text data. They excel at understanding and generating natural language, thanks to their ability to map words to vectors in a high-dimensional space—a process known as embedding. However, incorporating non-text data like images, videos, and audio presents unique challenges:

Dimensionality Differences: Text, images, videos, and audio have fundamentally different structures and dimensions. While text can be represented as a sequence of tokens, images are matrices of pixel values, videos are sequences of image frames, and audio is a waveform over time.
Feature Extraction: Each modality requires its own method of feature extraction to convert raw data into a form that can be integrated into a shared embedding space.
Alignment: The extracted features from different modalities need to be aligned so that they can coexist in the same embedding space, enabling the model to generate coherent outputs across multiple data types.

To address these challenges, multimodal LLMs employ modality-specific encoders and advanced architectural techniques.

Modality-Specific Encoders

At the heart of any multimodal LLM are modality-specific encoders. These encoders are responsible for converting raw data from different modalities into a shared embedding space. Let’s break down how each modality is typically handled:

1. Text Encoders

Text encoding is a well-established process in the realm of natural language processing (NLP). The basic idea is to convert a sequence of words (or tokens) into vectors that represent their semantic meaning in a high-dimensional space. Common architectures include:

Transformer-Based Models: Models like GPT and BERT use transformer layers to capture contextual information from text. Each word is represented as a token, and transformers are used to model relationships between these tokens to generate embeddings.
Word Embeddings: Pre-trained embeddings like Word2Vec or GloVe provide a dense vector representation of words based on their co-occurrence in large corpora. These embeddings serve as the input to further processing layers.

The output of the text encoder is a sequence of vectors, where each vector corresponds to a token in the input text.

2. Image Encoders

Images require a completely different approach. The goal is to transform the image into a set of features that can be represented in the same space as text tokens. Common strategies include:

Convolutional Neural Networks (CNNs): CNNs are the go-to architecture for image processing. Layers of convolutional filters are applied to the image to extract hierarchical features—edges, textures, objects, and more.
Vision Transformers (ViTs): Inspired by the success of transformers in NLP, ViTs treat an image as a sequence of patches, where each patch is a small section of the image. These patches are embedded into vectors, which are then processed by transformer layers.
ResNet, Inception, EfficientNet: These are popular CNN architectures that are often used as the backbone for image encoders in multimodal models. They provide robust feature extraction capabilities.

The output of the image encoder is typically a set of feature vectors, each representing a specific region or patch of the image.

3. Video Encoders

Video data is more complex than static images because it includes the temporal dimension. Encoding video data typically involves:

3D Convolutional Networks (3D CNNs): These networks extend the concept of 2D CNNs into three dimensions, allowing them to capture spatiotemporal features from video frames.
Recurrent Neural Networks (RNNs): RNNs, particularly LSTM or GRU, can be used to process sequences of frames, capturing temporal dependencies across them.
Transformer-Based Models: Similar to ViTs, some models use transformers to process video frames as a sequence of images. Temporal information is captured through the attention mechanism.

The output of the video encoder is a sequence of vectors, where each vector corresponds to a specific frame or a segment of frames.

4. Audio Encoders

Audio data is often represented as a waveform or a spectrogram (a visual representation of the frequency spectrum). Audio encoders typically involve:

Recurrent Neural Networks (RNNs): LSTM or GRU networks are commonly used to process sequences of audio features extracted from the waveform.
Convolutional Neural Networks (CNNs): CNNs can be applied to spectrograms to extract features, similar to how they process images.
WaveNet and Transformers: Advanced architectures like WaveNet model the audio signal directly, while transformers can be adapted to handle sequences of audio tokens.

The output of the audio encoder is a sequence of vectors that represent different aspects of the audio signal, such as pitch, tone, and rhythm.

Aligning Multimodal Features in a Shared Embedding Space

Once the features are extracted from the different modalities, the next challenge is to align them in a shared embedding space. This step is crucial because it enables the model to understand and generate content that integrates multiple modalities seamlessly.

1. Feature Projection

After feature extraction, the resulting vectors from different encoders may exist in different spaces due to their unique dimensionality and nature. To bring them into a common space, a projection step is often required. This involves:

Linear Projection: A simple linear transformation can map features from different modalities into a shared embedding space.
Attention Mechanisms: Attention layers can be used to align features by learning cross-modal relationships. For example, in the case of image-captioning models, attention can be used to highlight relevant image regions based on the corresponding text.
Cross-Modal Transformers: These are specialized transformers that process inputs from different modalities simultaneously, learning to align and fuse features through shared attention mechanisms.

2. Embedding Space Unification

The ultimate goal is to ensure that features from all modalities coalesce into a unified embedding space. This space allows the model to treat different types of data equivalently, facilitating tasks like:

Cross-Modal Retrieval: Given a text query, the model can retrieve relevant images or videos, and vice versa.
Multimodal Generation: The model can generate text descriptions from images or synthesize images based on textual prompts.
Integrated Understanding: The model can reason across modalities, such as answering questions about a video or analyzing sentiment in an audio clip.

Achieving this unification often requires joint training on large multimodal datasets, where the model learns to minimize the distance between related features from different modalities in the embedding space.

Tokenization in Multimodal LLMs

Tokenization is the process of converting raw input data into a sequence of tokens that the model can process. In the context of multimodal LLMs, tokenization extends beyond text to include tokens for images, videos, and audio.

1. Text Tokenization

Text tokenization is straightforward, as it involves breaking down text into words or subwords (using methods like BPE or WordPiece) and then mapping these to token IDs.

2. Image Tokenization

For images, tokenization can be approached in several ways:

Patch Tokens: In ViTs, images are divided into patches, and each patch is treated as a token. These tokens are embedded into vectors that the model processes.
Region Tokens: For models like object detectors, regions of interest (RoIs) in the image are identified, and each region is treated as a token.

3. Video Tokenization

Video tokenization extends image tokenization across the temporal dimension:

Frame Tokens: Individual frames or sequences of frames can be tokenized, similar to how patches are tokenized in images.
Spatiotemporal Tokens: Some models tokenize spatiotemporal regions, capturing both spatial and temporal information in each token.

4. Audio Tokenization

Audio tokenization typically involves:

Frame Tokens: Audio is split into short frames (e.g., 20ms), and each frame is treated as a token.
Frequency Tokens: Alternatively, the frequency components of the audio (e.g., in a spectrogram) can be tokenized.

Training Multimodal LLMs

Training multimodal LLMs is a complex process that requires large-scale data from various modalities and sophisticated training strategies:

1. Pretraining

Like text-only LLMs, multimodal models are often pretrained on vast amounts of data using self-supervised learning objectives. Common pretraining tasks include:

Masked Language Modeling (MLM): Masking parts of the text input and training the model to predict the masked tokens.
Masked Image Modeling (MIM): Similar to MLM, but

applied to images, where parts of the image are masked and the model must predict the missing pixels.

Contrastive Learning: Pairs of data from different modalities (e.g., an image and its corresponding caption) are used to train the model to bring related features closer in the embedding space while pushing unrelated ones apart.

2. Fine-Tuning

After pretraining, multimodal LLMs are fine-tuned on specific tasks using labeled data. Fine-tuning adapts the model to the nuances of particular applications, such as image captioning, video analysis, or audio transcription.

Applications of Multimodal LLMs

The ability of multimodal LLMs to process and generate content across multiple modalities unlocks a wide range of applications:

Image Captioning: Automatically generating textual descriptions for images.
Video Understanding: Analyzing and summarizing video content, such as generating a summary of key events.
Audio Transcription and Synthesis: Converting speech to text or generating speech from text inputs.
Cross-Modal Search: Enabling users to search for images, videos, or audio clips using textual queries.

Conclusion

Multimodal Large Language Models represent a significant leap forward in the field of AI, enabling seamless integration of text, image, video, and audio data. By understanding the architectural components—feature extraction, modality-specific encoders, and the alignment of features in a shared embedding space—we can appreciate the complexity and power of these models.

As these models continue to evolve, they will undoubtedly play an increasingly important role in a wide range of applications, from content creation to advanced search and beyond. Understanding their inner workings is essential for anyone looking to leverage the full potential of multimodal LLMs.

This blog post provides an in-depth look at the architectures that power multimodal LLMs, offering readers the knowledge they need to understand and work with these cutting-edge models. By grasping the principles of feature extraction, modality-specific encoding, and embedding space alignment, developers can unlock new possibilities in AI-driven applications.