MM1 – Apple’s First Large Multimodal Model AI Breakthrough

Updated on March 18 2024

Apple’s priority on devices might be the reason why we haven’t seen much from them in AI space, but it seems like that’s about to change. MM1 just got dropped on Arxiv out of nowhere, showing that Apple is back.

Apple’s researchers introduced MM1, their first large multimodal model. It represents a major leap for Apple in artificial intelligence space that understands both images and text. With MM1 boasting top-notch performance in tasks like visual question answering and image captioning, things are looking exciting.

Our article will explore how this innovation might make your gadgets smarter and life easier.

Overview of Apple’s MM1

Apple has quietly ramped up its AI capabilities, from acquiring AI companies like DarwinAI to systematic AI research. Apple’s MM1 stands as a pioneering large multimodal model, breaking new ground with its ability to train on both text and images.

MM1 operates through a combination of large-scale multimodal pre-training, utilizing both text and images. This approach is pivotal in achieving top-notch few-shot outcomes across various benchmarks.

The model’s capacity to perform multi-step reasoning over multiple input images using few-shot “chain-of-thought” prompting demonstrates its robust in-context learning abilities. With the largest 30 billion parameter MM1 model exhibiting strong potential, it displays impressive capabilities in grappling with complex tasks.

Apple’s MM1 employs advanced techniques for training large language models on text and images simultaneously – an innovative method developed by Apple researchers. By engaging in large-scale multimodal pre-training that encompasses image-caption, interleaved image-text, and text-only data, MM1 has managed to achieve excellent results across different benchmarks even with minimal input data.

Apple’s MM1 operates using advanced AI reasoning and tokenizes numbers in LLMs, enabling effective content watermarking and enhancing the efficacy of AI detectors. It underpins a tailored multi-agent system that uses mixture-of-experts and chain-of-thought reasoning to navigate the complexities of today’s ever-evolving AI industry.

These capabilities position the MM1 at the forefront of AI development, illustrating Apple’s dedication to investing heavily in this sector.

Also Read: Apple Strengthens AI Capabilities by Acquiring DarwinAI

Key Features of MM1

Apple’s MM1 boasts multimodal capabilities, advanced AI reasoning, and the ability to tokenize numbers in large language models. It enhances AI detectors and enables content watermarking, impacting the AI industry significantly.

Multimodal Capabilities

The MM1 model excels in handling both text and images, making it a powerhouse for understanding and analyzing complex data. With its 30 billion parameters, this large language model from Apple demonstrates elite performance on various benchmarks that measure multimodal abilities.

This means MM1 can interpret information from text and visuals simultaneously, offering more comprehensive insights.

This capability is crucial for applications across social media platforms, where content often combines visual elements with captions or descriptions. The model’s sensitivity to image resolution further enhances its effectiveness.

High-quality inputs lead to sharper analyses, proving that the details in imagery significantly influence MM1’s output quality. By processing these two types of data together effectively, MM1 sets new standards for AI-powered analytics and automation.

Advanced AI Reasoning

Apple’s MM1 showcases groundbreaking advancements in AI reasoning, setting a new standard for how machines understand complex tasks. With its 30 billion parameter model, MM1 demonstrates remarkable in-context learning capabilities.

It can navigate through multi-step reasoning processes with few-shot “chain-of-thought” prompting. This ability is crucial for accurately answering questions or solving problems that require understanding and analyzing multiple pieces of information simultaneously.

MM1 excels in various demanding tasks such as image captioning, visual question answering, and natural language inference by integrating these advanced AI reasoning skills. Its performance lays the groundwork for more intuitive interactions between humans and machines, transforming how we use technology daily.

Tokenizing Numbers in LLMs

Tokenizing numbers in large language models (LLMs) like Apple’s MM1 involves breaking down numerical information into smaller, more manageable pieces. This method helps the model better understand and process numbers, much like it does with text data.

By simplifying complex numerical data into tokens, LLMs can perform advanced AI reasoning on a wide range of tasks that include both textual and numerical inputs.

The process ensures that MM1’s 30B dense multimodal model effectively handles calculations and numeric reasoning alongside its impressive capabilities with text-only data. Tokenization makes the integration of different types of data smoother, enabling the model to deliver high-level performance across diverse applications.

This approach is crucial for developing real-world AI systems capable of understanding context from mixed data sources in a coherent manner.

Also Read: Understanding Multimodal LLM

Comparisons with Pre-existing Models

MM1 was evaluated on various benchmarks measuring multimodal capabilities, including visual question answering (VQAv2, TextVQA), scientific and mathematical reasoning (ScienceQA-IMG, MathVista), and general multimodal understanding (MMBench, SEED-Bench, LLaVA-Bench).

MM1 Performance Comparison

Benchmark/ModelMM1-30BGemini ProGemini UltraGPT-4VLLaVA-NeXT-34B
ScienceQA-IMG (8-shot)44.7%51.1% (8-shot)
MathVista (0-shot)39.4%45.2%
SEED-Bench (Image Part)72.1%
LLaVA-Bench (Wild)87.6%87.73%

Highlighting MM1’s Performance

  • VQAv2 Excellence: MM1 shines in the VQAv2 benchmark with an 83.7% score, closely matching the performance of LLaVA-NeXT-34B and significantly surpassing both Gemini models, demonstrating its robust visual question answering capabilities.
  • TextVQA Unmatched: MM1 stands out in TextVQA with an 81% score, where comparisons aren’t available for the other models, showcasing its unique strength in text and image comprehension tasks.
  • Science and Math: In the ScienceQA-IMG benchmark, MM1 shows commendable performance with a 44.7% score in an 8-shot setting, though LLaVA-NeXT-34B leads this category. For MathVista, MM1’s 0-shot score indicates strong mathematical reasoning, with Gemini Pro showing a slight edge.
  • Multimodal Benchmarks: MM1’s performance on MMBench and its adaptability in ‘in-the-wild’ scenarios, as demonstrated in the LLaVA-Bench, highlight its general multimodal understanding and application versatility.


MM1 emerges as a competitive and versatile model in the multimodal AI domain, with standout performances in both specialized and general multimodal tasks. Its capabilities in integrating and reasoning across textual and visual inputs position it as a key player, often matching or exceeding the performance of notable counterparts like LLaVA-NeXT-34B and outperforming specialized models like Gemini Pro and Ultra. This underlines MM1’s balanced strength in both vision and language domains, earmarking it as a promising model for diverse applications in AI research and real-world scenarios.

Also Read: Mistral Large Outshines GPT4, Claude and ChatGPT


MM1 represents a significant leap by Apple researchers into advancing AI, demonstrating their commitment to bridging the gap between technology’s understanding of images and text. Being competitive and even excelling in rigorous benchmarks, MM1 outperforms its counterparts, showcasing Apple’s growing focus on AI.

This model is not just about another LLM; it’s about making Apple’s focus on making its devices more intuitive and user-friendly. By integrating complex multimodal data processing capabilities, Apple researchers are paving the way for AI-enabled devices that interact in more human-like manners. MM1’s success in understanding and combining visual and textual information highlights the potential for creating smarter, more responsive technology.

Featured Tools







Related Articles