Multi-Agent Counterfactual Communication Using Difference Rewards Policy Gradients

Vanneste, Simon; Vanneste, Astrid; De Schepper, Tom; Mercelis, Siegfried; Hellinckx, Peter; Mets, Kevin

doi:10.1007/978-3-031-74650-5_5

Communication learning while learning a behaviour policy is a challenging problem within the multi-agent reinforcement learning domain. In this work, we combine the MACC (Multi-Agent Counterfactual Communication) method with the DR.PG (Difference Reward Policy Gradient) method and propose the novel DR.MACC (Difference Reward Multi-Agent Counterfactual Communication) method. The DR.MACC method enables us to create an agent-specific difference return for the action and communication policy of the agents. This policy-specific difference return minimizes the credit-assignment problem compared to using the team reward directly. The DR.MACC method does not require us to learn a joint Q-function, like the MACC method, but instead operates using the environment’s reward function. Alternatively, when the reward function is unavailable, we can learn an approximation of the reward function in the DRR.MACC method. Here, the agent’s environment interactions are used to train the approximation of the reward function using supervised learning. In the experiments, we compare the novel DR.MACC method against the MACC method with an individual Q-function and a joint Q-function. The results show that the DR.MACC method can outperform both MACC variants in the different environment configurations.

Multi-Agent Counterfactual Communication Using Difference Rewards Policy Gradients

Date

Author(s)

Journal

Abstract

Description

Statistics

Views

Citations

Statistics

Views

Citations