Reinforcement Learning for Soccer Playing
- Source: Software project as part of Reinforcement Learning lecture
- Type: Individual work
- Language(s): Python
In this lecture project for the Reinforcement Learning lecture at the University of Edinburgh, several reinforcement learning algorithms were implemented for simple stochastic tasks and the half field offense (HFO) 2D environment. All implementations built upon provided code. First, several simple RL methods were implemented based on pseudocode provided by Sutton and Barto in their introduction to reinforcement learning(2018):
- The dynamic programming algorithm value iteration (page 83) iterates through all states and repeatedly updates the value estimates respectively until these estimates converge.
- On-policy first-visit Monte Carlo control with ε-soft policies (page 101) generates entire episodes following an arbitrary ε-soft policy and updates the values for each state-action pair based on received returns.
- SARSA on-policy temporal-difference control (page 130) computes state-action values by updating estimates after each step with respect to taken action, observed states and reward.
- Q-Learning (page 131) executes similar temporal-difference learning, but updates its estimates off-policy by maximising over possible action-values of the next state.
Following these comparably simple algorithms, 1-step asynchronous Q-learning was implemented using deep q-networks (DQNs) for value function approximation. Implementation and optimisation of this deep learning architecture was based on PyTorch.
Lastly, multiple multi-agent RL algorithms were implemented controlling two cooperating agents in the HFO environment attacking against a single defending soccer player:
- Independent Q-learning operates identical to normal Q-learning implemented above for each individual agent on the joint state-space and (wrongly!) assumes their independency.
- Joint action learning (see table 4) also operates similar to Q-learning, but operates on joint actions. Additionally, a model of opponent behaviour is maintained to calculate updates based on these predictions.
- WoLF-PHC (see table 1 and 2) executes greedy policy updates following a hill-climbing approach to train a stochastic policy.