Publication:

GPI-tree search: algorithms for decision-time planning with the general policy improvement theorem

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0002-9358-8565
cris.virtual.orcid0000-0003-0351-1714
cris.virtual.orcid0000-0001-6300-6993
cris.virtual.orcid0000-0002-4812-4841
cris.virtual.orcid0000-0002-2969-3133
cris.virtualsource.department0f9c7417-0dde-4b0c-9c8a-f692de0bbd8b
cris.virtualsource.department0e177830-d028-449f-9e57-ea9fa8c7b866
cris.virtualsource.department1247af72-23b0-41ba-881e-6215b094158f
cris.virtualsource.departmentc51c977b-dc5a-451e-ac25-4b9f2b738719
cris.virtualsource.department5f457973-5b9f-4593-8a29-1eeb47f32775
cris.virtualsource.orcid0f9c7417-0dde-4b0c-9c8a-f692de0bbd8b
cris.virtualsource.orcid0e177830-d028-449f-9e57-ea9fa8c7b866
cris.virtualsource.orcid1247af72-23b0-41ba-881e-6215b094158f
cris.virtualsource.orcidc51c977b-dc5a-451e-ac25-4b9f2b738719
cris.virtualsource.orcid5f457973-5b9f-4593-8a29-1eeb47f32775
dc.contributor.authorBagot, Louis
dc.contributor.authorD'eer, Lynn
dc.contributor.authorLatre, Steven
dc.contributor.authorDe Schepper, Tom
dc.contributor.authorMets, Kevin
dc.date.accessioned2026-01-08T13:17:49Z
dc.date.available2026-01-08T13:17:49Z
dc.date.issued2025
dc.description.abstractIn Reinforcement Learning, Unsupervised Skill Discovery tackles the learning of several policies for downstream task transfer. Once these skills are learnt, the question of how best to use and combine them remains an open problem. The General Policy Improvement Theorem (GPI) creates a policy stronger than any individual skill by selecting the highest-valued policy, generally evaluated with Successor Features. However, the GPI policy is unable to mix and combine the skills at decision time to formulate stronger plans. In this paper, we propose to adopt a model-based setting in order to make such planning possible, and formally show that a forward search improves on the GPI policy and any shallower searches under some approximation term. We argue for decision-time planning, and design a family of algorithms, GPI-Tree Search Algorithms, to use Monte Carlo Tree Search (MCTS) with GPI. These algorithms foster the skills and Q-value priors of the GPI framework to guide and improve the search, which we back up with visual intuition for the different design choices. Our experiments show that the resulting policies are much stronger than the GPI policy alone, even under approximation; they can also improve beyond the linear constraint of Successor Features.
dc.identifier.doi10.1007/s00521-025-11304-4
dc.identifier.issn1433-3058
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/58627
dc.provenance.editstepusergreet.vanhoof@imec.be
dc.publisherSpringer
dc.source.beginpage11404
dc.source.endpage11311
dc.source.issue23
dc.source.journalNeural Computing and Applications
dc.source.numberofpages8
dc.source.volume37
dc.title

GPI-tree search: algorithms for decision-time planning with the general policy improvement theorem

dc.typeJournal article
dspace.entity.typePublication
Files

Original bundle

Name:
s00521-025-11304-4.pdf
Size:
1.72 MB
Format:
Adobe Portable Document Format
Description:
Published
Publication available in collections: