Most of you might have driven your car at 100km/h on an expressway, not a big deal, right? But can you dare do the same on hilly terrain? Certainly not, unless you are Dominic Toretto from the famous “Fast & Furious” film series. 

Now, let me ask you, why can’t one drive at such high speed on hilly terrain? Simple, because it’s unsafe. You don’t need to experience an accident to understand this fact. We, humans, have a unique power called ‘imagination’ that allows us to consider numerous courses of action and outcomes, without having to put ourselves through real danger. And this ability enables us to learn about potential sources of threats without exposing ourselves or others to the risks that come with them. 

What if we can give the same ability to AI systems? 

That’s exactly what DeepMind has done with this new technique. 

Let’s cut to the chase 

Way back in December 2019, the DeepMind researchers introduced ReQueST, the abbreviation for Reward Query Synthesis via Trajectory optimisation. It is an algorithm introduced to make AI systems learn objectives from human feedback on hypothetical behaviours. The DeepMind Safety Research team describes three main components of ReQueST in a blog post which include: 

  • A neural environment simulator — a dynamics model learned from trajectories generated by humans exploring the environment safely. In their work, this is a pixel-based dynamics model. 
  • A reward model that learned from human feedback on videos of (hypothetical) behaviour in the learned simulator. 
  • Trajectory optimisation, so that we can choose hypothetical behaviours to ask the human about that help the reward model learn what’s safe and what’s not (in addition to other aspects of the task) as quickly as possible. 

 

Image Credit: DeepMind 

Simply put, the model is presented with hypothetical situations. Say, for example, whether to speed up the race on a foggy morning? The human feedback will be negative; hence the result will be that an agent will behave similarly to what humans expect it to do. Thus, it will avoid certain behaviours which humans have indicated as unsafe.  

However, there was a limitation to this model - all three components were tried and tested in 2D environments. Now, the question was if it can be scaled in a complex 3D environment?

Now what? 

The success of deep reinforcement learning for a task lies in the availability of a procedural reward function and a simulated (man-made/no real) environment. Hence the research remains insulated from challenges in real-world learning like safe behaviour. To understand the limitations of safe behaviour, online reinforcement learning requires first-hand experience, which might look fine in a simulated environment but can be extremely dangerous in the real world. 

To this end, the researchers took a step further towards testing their algorithm ReQueST in a complex 3-D environment. The team performed different tasks to demonstrate that ReQueST may be used to teach an agent in a 3D environment with an order-of-magnitude lower instances of risky behaviour than reinforcement learning generally requires. One can access the research paper here. The results thus produced were exciting for the team mainly because of two reasons (as mentioned by the team itself): 

  • First, they demonstrate that ReQueST is plausibly a general-purpose solution to (one version of) the safe exploration problem. 
  • Second, the results hint at a future where their ability to verify agent behaviour before deployment is not bottlenecked by the availability of a simulator – the simulator can be learned from data. 

Finally, “ReQueST assumes that unsafe states (or a superset of states surrounding unsafe states) can be recognised by humans. This assumption is likely reasonable for tasks of current practical interest. However, it may be too strong an assumption looking to the future, particularly for tampering: future AI systems may find ways to tamper that are pernicious exactly because they cannot be recognised by humans. To guard against these kinds of issues, we will need further progress on scalable oversight techniques,” the team concluded while discussing whether to consider ReQueST a realistic, scalable approach to safe exploration.  

Simply put, training RL agents in the presence of unsafe conditions is nothing but a safe exploration problem and the field requires further research for robust AI systems in the future. 

Sources of Article

Image from Canva

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE