MIT CSAIL researchers pre-trained a computer vision model for image categorization. It uses many simple, uncurated synthetic picture-generating programmes.

The researchers did not edit or curate the programmes, which consisted of just a few lines of code each. Instead, each row of images in this image was generated using one of three different image-generating tools. A machine-learning model must be taught before fulfilling a task, such as spotting cancer in medical photos. Training image classification models expose the model to an extensive dataset, including millions of example pictures.

However, using real image data might present practical and ethical concerns: the photos may breach copyright rules, invade the privacy of individuals, or be biased towards a specific racial or ethnic group. Researchers might generate synthetic data for model training using picture-generating systems to overcome these issues. However, these strategies are limited due to the need for specialised knowledge to develop image-generating software that can manually generate useful training data.

Objective

Researchers from MIT and other institutions adopted an alternative strategy. Instead of developing specialised picture-generating programmes for a specific training assignment, they compiled a dataset of 21,000 freely accessible programmes from the internet. Then, they trained a computer vision model using this extensive collection of basic image-generating routines.

These applications generate various graphics with simple colours and textures. The researchers did not edit or curate the programmes, which consisted of just a few lines of code each.

AI model

The models trained with this vast dataset of programme classifications categorised photos more precisely than other models created synthetically. Moreover, although their models outperformed those trained with real data, the researchers demonstrated that increasing the number of picture programmes in the dataset improved model performance, suggesting a method for achieving greater accuracy. Typically, machine-learning models are pre-trained, which means they are trained on one dataset to construct parameters we may apply to a different job. So, for example, a model for identifying X-rays might be pre-trained on a massive dataset of synthetically generated images before being trained on a much smaller dataset of actual X-rays.

Previously, these researchers demonstrated that they could employ a handful of picture-generating programmes to generate synthetic data for model pre-training. However, the systems had to be meticulously crafted so that the synthetic images matched the specific characteristics of real photos. As a result, it made it difficult to scale the process. In the new effort, they instead utilised a massive dataset of uncurated image-generating systems. They began by collecting 21,000 programmes that generate photos from the internet. All programmes are created in a simple programming language and consist of only a few lines of code, allowing them to create graphics quickly.

Researchers did not need to create photographs to train the model in advance because these simple programmes can execute quickly. Instead, the researchers discovered they could concurrently produce photos and train the model, streamlining the process. They used their enormous dataset of image-generating algorithms to pre-train computer vision models for supervised and unsupervised picture classification tasks. In supervised learning, the picture data are labelled, whereas, in unsupervised learning, the model is taught to classify images without labels.

Conclusion

Their models were more accurate, putting photos into the proper categories more frequently than state-of-the-art computer vision models pre-trained using fake data. Even though synthetic data models had lower accuracy than real data models, their strategy narrowed the performance difference by 38%. To identify characteristics that affect model accuracy, the researchers pre-trained each unique picture creation software separately. They discovered that a model performs better when a programme provides a more varied set of photos. Vibrant photos with full-canvas scenes improved model performance the most.

Furthermore, the researchers aim to apply their method to other forms of data, such as multimodal data that includes text and images, now that they have shown the effectiveness of their pre-training strategy. They also intend to keep looking for ways to enhance image classification efficiency.

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in