You are currently viewing Microsoft is abandoning Florence-2, a unified model for handling various visual tasks

Microsoft is abandoning Florence-2, a unified model for handling various visual tasks

It’s time to celebrate the amazing women leading AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Find out more


Today, Microsoft’s Azure AI team released a new vision-based model called Hugging Face Florence-2.

Available under the MIT Permissive License, the model can handle a variety of vision and vision language tasks using a unified, prompt-based representation. It comes in two sizes—232M and 771M parameters—and already excels at tasks like labeling, object detection, visual grounding, and segmentation, performing on par or better than many large vision models out there.

Although the model’s real-world performance has yet to be tested, the work is expected to give enterprises a single, unified approach to working with different types of vision applications. This will save investment in separate, task-specific vision models that do not go beyond their primary function, without extensive fine-tuning.

What makes Florence-2 unique?

Today, large-scale language models (LLM) are at the heart of enterprise operations. A model can provide resumes, write marketing copy, and even handle customer service in many cases. The level of adaptability between domains and tasks is amazing. But that success also led researchers to wonder: Can vision models that are largely task-specific do the same?


Registration for VB Transform 2024 is open

Join enterprise leaders in San Francisco July 9-11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register now


At their core, visual tasks are more complex than text-based natural language processing (NLP). They require a comprehensive faculty of perception. Essentially, to achieve a universal representation of a variety of visual tasks, a model must be able to understand spatial data at different scales, from broad image-level concepts such as object location, to granular pixel details, as well as semantic details such as superscripts. level to detailed descriptions.

When Microsoft tried to solve this, it found two key obstacles: a lack of comprehensively annotated visual datasets and a lack of a unified pre-training framework with a single network architecture that integrates the ability to understand spatial hierarchy and semantic granularity.

To address this, the company first used specialized models to generate a visual dataset called FLD-5B. It includes a total of 5.4 billion annotations for 126 million images, covering details from high-level descriptions to specific regions and objects. Then, using this data, he trained Florence-2, which uses a sequence-to-sequence architecture (a type of neural network designed for tasks involving sequential data) integrating an image encoder and a multimodal encoder-decoder. This allows the model to handle a variety of visual tasks without requiring task-specific architectural modifications.

“All annotations in the dataset, FLD-5B, are uniformly standardized in textual outputs, facilitating a unified multi-task learning approach with sequential optimization with the same loss function as the target,” the researchers wrote in the paper describing the model. “The result is a universal core vision model capable of performing a variety of tasks…all within a single model driven by a single set of parameters. Task activation is achieved through text prompts, mirroring the approach used by large language models.’

Better performance than larger models

When prompted with input images and text, Florence-2 handles a variety of tasks, including object detection, captioning, visual grounding, and visual question answering. More importantly, it delivers this at a quality equal to or better than many larger models.

For example, in a zero-hit caption test on the COCO dataset, both Florence’s 232M and 771M versions outperformed Deepmind’s 80B-parameter Flamingo visual language model with scores of 133 and 135.6, respectively. They even did better than Microsoft’s own Kosmos-2 model, specific to visual grounding.

When fine-tuned with public human-annotated data, Florence-2, despite its compact size, was able to compete closely with several larger specialized models in tasks such as visual question answering.

“Florence-2’s pre-trained backbone improves performance on downstream tasks, e.g. COCO object detection and instance segmentation and ADE20K semantic segmentation, outperforming both supervised and self-supervised models,” the researchers note. “Compared to the pre-trained ImageNet models, ours improves the training efficiency by a factor of 4 and achieves significant improvements of 6.9, 5.5 and 5.9 points on the COCO and ADE20K datasets.”

Currently, both pretrained and fine-tuned versions of the Florence-2 232M and 771M are available on Hugging Face under an MIT Permissive License that allows unrestricted distribution and modification for commercial or personal use.

It will be interesting to see how developers will use it and free up the need for separate vision models for different tasks. Small, task-independent models can not only save developers the need to work with different models, but also greatly reduce computational costs.

Leave a Reply