Two computer tasks work separately to create and assess images. Despite working in the same department, the two capabilities run different functions. MIT’s Computer Science and Artificial Intelligence Laboratory researchers found a way to merge the two into a unified vision system.
Recently, the team trained a system to deduce the missing parts of an image, a task that requires studying the image on a deeper level. In finding the missing elements, the Masked Generative Encoder (MAGE) system hits two targets simultaneously: accurately identifying the images and generating new ones with an uncanny resemblance to reality.
The hybrid system supports various potential applications, such as object categorization and classification within images, creating images under specific instructions, and improving existing photos.
Unlike other systems, MAGE converts images into ‘semantic tokens’, compact and abstract versions of a part of a photo. Like creating sentences from words, these tokens generate an abstract version of an image to simplify complex processing tasks while preserving the quality and information of the original image. Furthermore, this tokenization can be trained within a self-learning framework, which can further pre-train on large datasets with supervision.
Taking a step further, MAGE’s ‘masked token modeling’ randomly hides some tokens, generating an incomplete image and training a neural network to complete it. This tokenization allows the system to learn image patterns (image recognition) and create new ones (image generation).
According to Tianhong Li, a Ph.D. student at MIT, a CSAIL affiliate, and the study’s lead author, MAGE’s pre-training variable masking strategy allows it to prepare for any task, whether image generation or recognition, within the same system.
“MAGE’s ability to work in the ‘token space’ rather than ‘pixel space’ results in clear, detailed, and high-quality image generation, as well as semantically rich image representations. This could hopefully pave the way for advanced and integrated computer vision models,” Li continues.
In addition to generating realistic images from step one, MAGE enables conditional image creation. Users can create images based on specific criteria, and the system will design an image accordingly. MAGE has set new records in image creation, performing better than previous models. The system also demonstrated exceptional image recognition results, showing an 80.9% accuracy in linear probing and 71.9% on ImageNet.
However, despite its dual purpose, MAGE still needs significant improvement, as converting images into tokens leads to information loss. MIT’s research time continues to explore ways to compress images while preserving important details. Speaking of MAGE, Huisheng Wang, Google’s Research and Machine Intelligence division’s senior staff software engineer, said, “This innovative system has wide-ranging applications and has the potential to inspire many future works in the field of computer vision.”