Model-based reinforcement learning (MBRL) algorithms do not typically consider environments – subject to multiple dynamics modes – where it is beneficial to avoid inoperable or undesirable dynamics modes. We present a MBRL algorithm that avoids entering such inoperable or undesirable dynamics modes, by constraining the controlled system to remain in a single dynamics mode with high probability. This is a particularly difficult problem because the mode constraint is unknown a priori. We propose to jointly infer the mode constraint, along with the underlying dynamics modes. Importantly, our method infers latent structure that our planning scheme leverages to 1) enforce the mode constraint with high probability, and to 2) target exploration where the mode constraint’s epistemic uncertainty is high. We validate our method by showing that it can navigate a simulated quadcopter – subject to a turbulent dynamics mode – to a target state, whilst remaining in the desired dynamics mode with high probability.

Type

Publication

26th International Conference on Artificial Intelligence and Statistics

Here we show visualisations of our experiments in the simulated quadcopter navigation problem.

Experiment | Description |

We are not able to solve our δ-mode-constrained navigation problem with the greedy exploitation strategy becaue it leaves the desired dynamics mode. | |

Adding the δ-mode-constraint to the greedy exploitation strategy is still not able to solve our δ-mode-constrained navigation problem. This is because the optimisation gets stuck at a local optima induced by the constraint. | |

Our strategy successfully solves our δ-mode-constrained navigation problem by augmenting the greedy exploitation objective with an intrinsic motivation term. Our intrinsic motivation uses the epistmic uncertainty associated with the learned mode constraint to escape local optima induced by the constraint. | |

Here we show the importance of using only the epistemic uncertainty for exploration. This experiment augmented the greedy objective with the entropy of the mode indicator variable. It cannot escape the local optimum induced by the mode constraint because the mode indicator variable's entropy is always high at the mode boundary. This motivated formulating a dynamics model which can disentangle the sources of uncertainty in the mode constraint. | |

Finally, we motivate why our intrinsic motivatin term considers the joint entroy over a trajectory, instead of summing the entropy at each time step (as is often seen in the literature). This experiment formulated the intrinsic motivation term as the sum of the gating function entropy at each time step. That is, it assumed each time step is independent and did not consider the information gain over an entire trajectory, i.e. the exploration is myopic (aka shortsighted). |