Model-based value expansion

Model-based value expansion (MVE) was first proposed by [cite/t:@feinbergModelBasedValueEstimation2018]. They expand the critic's target value using a deterministic dynamics model.

Stochastic ensemble value expansion (STEVE) [cite:@buckmanSampleEfficientReinforcementLearning2018] extend MVE by learning a dynamics model with an ensemble of probabilistic NNs. It combines different length value expansions by weighting them according to the inverse-variance from the dynamics ensemble.

Stochastic value gradient (SVG) [cite:@amosModelBasedStochasticValue2021] uses a deterministic dynamics model for MVE of policy (actor) objective but not the critic. It then uses SAC to get exploration using entropy.

[cite/t:@palenicekDiminishingReturnValue2022;@palenicekRevisitingModelbasedValue2022] tried using the oracle dynamics (i.e. the simulator) for the model-based value expansion. They found that even when using the oracle dynamics, model-based value expansion could not improve sample efficiency. They suggest model-free value expansion (e.g. Retrace) is a strong baseline without the computational overhead of model-based methods. I can't help but think they could increase the UTD when using MVE to see improvement in sample efficiency.

Aidan Scannell
Aidan Scannell
Postdoctoral Researcher

My research interests include model-based reinforcement learning, probabilistic machine learning (gaussian processes, Bayesian neural networks, approximate Bayesian inference, etc), learning-based control and optimal control.