Results for ""
The research team investigates the possibilities of learning visual representations through synthetic images generated by text-to-image algorithms. They are the first to demonstrate that models trained purely on synthetic images outperform models trained on real photos in large-scale environments.
Data is the new soil, and scientists are planting more than just pixels in this rich new soil. Recently, results from standard "real-image" training methods of machine learning models were surpassed by a team of scientists employing artificial images.
The critical component of the method is a system named StableRep, which creates synthetic images using well-liked text-to-image models such as Stable Diffusion. It is similar to building universes out of words.
By classifying multiple images generated from identical text prompts as positive pairings, this method not only increases the diversity of the input but also provides the vision system with the ability to distinguish between similar and dissimilar images during training. Significantly, StableRep exhibited superior performance in large datasets compared to top-tier models trained on actual photos, including SimCLR and CLIP.
The process of gathering information has always been challenging. In the 1990s, researchers had to take pictures of items and face by hand to assemble datasets. In the 2000s, people searched the internet for information. But this raw, unprocessed data often didn't match real-life situations and showed how society's attitudes affected things, giving a skewed picture of reality. Human involvement is costly and very hard to do when cleaning datasets. Imagine, though, if this hard work of gathering data could be boiled down to something as easy as speaking an order.
Changing the "guidance scale" in the generative model is vital to StableRep's success because it strikes a careful balance between the variety and accuracy of the synthetic images. When fine-tuned, the fake images used to teach these self-supervised models worked well, if not better, than the real ones.
A step further was taken by adding language control, which made a better version called StableRep+. When trained with 20 million fake images, StableRep+ got better accuracy and worked amazingly well compared to CLIP models that were introduced with an impressive 50 million real images.
The researchers are honest about a few problems, such as the slow speed at which images are currently being made, semantic mismatches between text prompts and the images produced, the possibility that biases will be amplified, and the difficulty in figuring out who a picture belongs to. All of these problems need to be fixed for future progress to be made. Another problem is that StableRep must first use real data to build the generative model.
The team agrees that real data is still needed to get started, but once you have a good generative model, you can use it for other things, like training recognition models and making visual displays. Furthermore, the team says they still need to start with accurate data. The good thing about generative models is that once you have one, you can use it for other things, like training recognition models and visual displays.
Image source: Unsplash