Join us in returning to NYC on June 5th to collaborate with executive leaders in exploring comprehensive methods for auditing AI models regarding bias, performance, and ethical compliance across diverse organizations. Find out how you can attend here.
As competition in the generative AI field shifts toward multimodal models, Meta has released a preview of what can be its answer to the models released by frontier labs. Chameleon, its new family of models, has been designed to be natively multi-modal instead of putting together components with different modalities.
While Meta has not released the models yet, their reported experiments show that Chameleon achieves state-of-the-art performance in various tasks, including image captioning and visual question answering (VQA), while remaining competitive in text-only tasks.
The architecture of Chameleon can unlock new AI applications that require a deep understanding of both visual and textual information.
Early-fusion multimodal models
The popular way to create multimodal foundation models is to patch together models that have been trained for different modalities. This approach is called “late fusion,” in which the AI system receives different modalities, encodes them with separate models and then fuses the encodings for inference. While late fusion works well, it limits the ability of the models to integrate information across modalities and generate sequences of interleaved images and text.
Chameleon uses an “early-fusion token-based mixed-modal” architecture, which means it has been designed from the ground up to learn from an interleaved mixture of images, text, code and other modalities. Chameleon transforms images into discrete tokens, as language models do with words. It also uses a unified vocabulary that consists of text, code and image tokens. This makes it possible to apply the same transformer architecture to sequences that contain both image and text tokens.
According to the researchers, the most similar model to Chameleon is Google Gemini, which also uses an early-fusion token-based approach. However, Gemini uses separate image decoders in the generation phase, whereas Chameleon is an end-to-end model that both processes and generates tokens.
“Chameleon’s unified token space allows it to seamlessly reason over and generate interleaved image and text sequences, without the need for modality-specific components,” the researchers write.
While early fusion is very appealing, it presents significant challenges when training and scaling the model. To overcome these challenges, the researchers employed a series of architectural modifications and training techniques. In their paper, they share the details about the different experiments and their effects on the model.
The training of Chameleon takes place in two stages, with a dataset containing 4.4 trillion tokens of text, image-text pairs, and sequences of text and images interleaved. The researchers trained a 7-billion- and 34-billion-parameter version of Chameleon on more than 5 million hours of Nvidia A100 80GB GPUs.
Chameleon in action
According to the experiments reported in the paper, Chameleon can perform a diverse set of text-only and multimodal tasks. On visual question answering (VQA) and image captioning benchmarks, Chameleon-34B achieves state-of-the-art performance, outperforming models like Flamingo, IDEFICS and Llava-1.5.
According to the researchers, Chameleon matches the performance of other models with “much fewer in-context training examples and with smaller model sizes, in both pre-trained and fine-tuned model evaluations.”
One of the tradeoffs of multimodality is a performance drop in single-modality requests. For example, vision-language models tend to have lower performance on text-only prompts. But Chameleon remains competitive on text-only benchmarks, matching models like Mixtral 8x7B and Gemini-Pro on commonsense reasoning and reading comprehension tasks.
Interestingly, Chameleon can unlock new capabilities for mixed-modal reasoning and generation, especially when the prompts expect mixed-modal responses with text and images interleaved. Experiments with human-evaluated responses show that overall, users preferred the multimodal documents generated by Chameleon.
In the past week, both OpenAI and Google revealed new models that provide rich multimodal experiences. However, they have not released much detail on the models. If Meta continues to follow its playbook and release the weights for Chameleon, it could become an open alternative to private models.
Early fusion can also inspire new directions for research on more advanced models, especially as more modalities are added to the mix. For example, robotics startups are already experimenting with the integration of language models into robotics control systems. It will be interesting to see how early fusion can also improve robotics foundation models.
“Chameleon represents a significant step towards realizing the vision of unified foundation models capable of flexibly reasoning over and generating multimodal content,” the researchers write.