Augmented Reality (AR) serves to connect the physical and digital worlds, supplementing or extending the former in contrast to Virtual Reality (VR). As the two most common display content of Augmented Reality (AR), image and text creation often require a human to execute. However, due to the rapid advances in Artificial Intelligence (AI), today, media content can be automatically generated by software. Furthermore, the ever-improving quality of AI-generated content (AIGC) has opened new scenarios employing such content, which is expected to be applied in AR. 

There are three main methods of AR display: Spatial Augmented Reality (SAR), Head-Mounted Display (HMD), and Hand-Held Display (HHD). AR displays offer enhanced contextual information and immersion while retaining focus on the physical world for users. Furthermore, recent advancements in Artificial Intelligence (AI) technology have improved the capabilities of AI-generated content (AIGC) to the extent that the boundary between human and machine-generated content has become impressively indistinguishable. 

OpenAI's GPT-3 can automatically generate text, translate languages, and answer text-based questions. In addition, stable Diffusion has gained popularity due to its faster generation speed than earlier models like Disco Diffusion and DALL·E 2 for image generation, greatly reducing generation time to an average of dozens of seconds. Consequently, due to their greater usability and flexibility, AI-based generative models are anticipated to become more prevalent in various applications, including automatic code generation and digital art creation. 

In a study on "Exploring the Design Space of Employing AI-Generated Content for Augmented Reality Display", researchers at the University of New South Wales, Australia, deployed generative AI models in the AR display system and then conducted an empirical study based on this prototype, for summarizing and exploring the design space and potential applications.  

Implementing the prototype 

To enhance the more intuitive experience of subsequent interviewees and obtain more design references, the researchers developed a preliminary prototype system called "GenerativeAIR". It could be seen as an instance of AIGC+AR to be explored. The system comprises software and hardware. The system generates media content based on text descriptions and outputs text and images displayed on different AR devices.  

The researchers first used the built-in microphone in the mobile phone to convert the voice of the user into text due to the popularity of smartphones. Next, the application programming interface (API) provided by Google is leveraged for converting speech to text. As for the AI generation part, the interfaces of two text-input models are applied, i.e. the ChatGPT (GPT-3) is for text-text generation, and the Stable Diffusion 2 is for text-image generation. 

Image taken from Cornell University research.

The user speaks through a microphone, and then the transcribed text is input into two AI models to generate corresponding images and text (as shown in Figure 1 (a)). Finally, the generated media content is transmitted to different AR devices through the network and displayed (as shown in Figure 1 (b) to (d)). The team conducted three focus groups with ten participants (G1=3; G2=3; G3 =4), each lasting approximately 80 minutes and consisting of five steps.  

Analyzing the effect 

According to the research, the comparison of AR and AIGC and related technologies stimulates ideas for application scenarios. Further exploration of their technical features and details can improve the interactive experience and system performance. 

Via clustering and merging, the researchers condensed the multiple factors gathered from the focus group interviews into three overarching categories, namely "user", "function", and "environment". These categories are widely recognized as fundamental considerations in the design of interactive systems. 

The AIGC+AR system is envisioned to offer diverse functions for multiple purposes, with particular attention paid to the AI-related software in this design phase, as the AI algorithms are crucial to realizing these functions. For such a system, the feedback supplied by the environment to the user mainly depends on the AR display's presentation.  

The Generative AI models can enhance real-time creative media generation. For example, AR displays can optimize lifelong by supplementing captured objects with relevant information. In addition, GenerativeAIR can address privacy and privilege classification issues in multi-user scenarios by assigning hierarchical display content based on user permissions and privacy levels through id-authentication. 

The prototype created by the researchers is currently offline and lacks real-time interaction, hindering its practical application. Future work is intended to focus on implementing real-time functionality and integrating additional software and hardware to enrich the system's functions. Moreover, they have not addressed the hierarchical difficulty of privacy and permissions in multi-user scenarios, a critical issue for collaborative and sharing settings. 

 

Sources of Article

Research published by Cornell University.

Banner Image: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE