Generative AI has been gaining significant attention from society. As a result, many individuals have become interested in related resources and seek to uncover the background and secrets behind its impressive performance. The goal of AI generated content (AIGC) is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. AIGC is achieved by extracting and understanding intent information from instructions provided by humans and generating the content according to its knowledge and intent information. 

In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and, thus, improved generation results. Furthermore, with the growth of data and the size of the models, the distribution that the model can learn becomes more comprehensive and closer to reality, leading to more realistic and high-quality content generation.  

However, there is a giant boom in the generative AI sphere, but it is not a recent phenomenon. Generative AI models have been present for quite some time but are not known to many. In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and, thus, improved generation results.

The evolution 

Generative models have a long history in artificial intelligence, dating back to the 1950s with the development of hidden markov models (HMMs) and gaussian mixture models (GMMs). These models generated sequential data such as speech and time series. With the advent of deep learning, generative models saw significant improvements in performance. 

In natural language processing (NLP), a traditional method to generate sentences is to learn word distribution using N-gram language modelling and then search for the best sequence. However, this method cannot effectively adapt to long sentences. This issue was tackled by introducing RNNs for language modelling tasks.  

The advent of generative AI models in various domains has followed different paths, but eventually, the intersection emerged: the transformer architecture. Introduced by Vaswani et al. for NLP tasks in 2017, Transformer was later applied in computer vision and then became the dominant backbone for many generative models in various domains. 

History of generative AI in CV, NLP and VL. 

Foundation models Proposed to solve the limitations of traditional models such as RNNs, Transformer is the backbone architecture for many state-of-the-art models, such as GPT-3, DALL-E-2, Codex, and Gopher. Since the introduction of the Transformer architecture, pre-trained language models have become the dominant choice in NLP due to parallelism and learning capabilities. Generally, these transformer-based pre-trained language models can be commonly classified into two types based on their training tasks:  

  • Autoregressive language modelling  
  • and masked language modelling 

Furthermore, despite being trained on large-scale data, the AIGC may not always produce output that aligns with the user’s intent. To overcome this issue, reinforcement learning from human feedback (RLHF) has been applied to fine-tune models in various applications. 

The development of computing with enhanced hardware, distributed training, and cloud computing contributed to the development of foundation models. 

Unimodal and multimodal 

Generally, GAI models can be categorized into unimodal models and multimodal models. Unimodal models receive instructions from the same modality as the generated content modality, whereas multimodal models accept cross-modal instructions and produce results of different modalities.  

Generative language models (GLMs) are unimodal models trained to generate readable human language based on patterns and structures in input data that they have been exposed to. These models can be used for a wide range of NLP tasks, such as dialogue systems, translation and question-answering. These include decoder models and encoder-decoder models. Vision generative models are other kinds of unimodal. 

Multimodal generations might be comparatively hard to learn compared to unimodal. The generation of states-of-the-art multimodal in vision language generation, text audio generation, text graph generation and text code generation aided in tackling this issue.  

The application of these architectures can be seen in Chatbots, AI art generation, music generation, coding for AI-based programming systems, and education. 

Further ahead 

Deep generative AI models with neural networks have dominated the field of machine learning for the past decade. This trend is also seen in natural language understanding, where models like BERT and GPT-3 have been developed with a large number of parameters. However, increasing model footprint and complexity, as well as the cost and resources required for training and deployment, pose challenges for practical deployment in the real world.  

Prompt learning enables the language models to be retrained on large amounts of raw text data and be adapted to new domains without tuning them again. Similarly, Model compression is an effective approach to reducing the model size and improving computation efficiency.  

While AIGC has the potential to be incredibly useful in many different applications, it also raises significant concerns about security and privacy. However, these issues could be overcome by specialization and generalization in the choice of foundation models, expansion of human knowledge through continual learning, reasoning, scaling up, and finally, addressing social concerns. 

Sources of Article

Images and content source: Study by Cornell University

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE