Demonstrations of Several Solutions to the Pole-Balancing Problem

by Jeff Lawson and Chris Lewis

 

Abstract
Introduction
Problem
Adaptive Heuristic Critic
Symbiotic Adaptive Neuro-Evolution
Bibliography

Abstract

Several methods of pole-balancing have been coded in languages such as FORTRAN, C, and Java. However, very few are available as applets which can be run directly across the Internet. We converted several existing programs to Java applet format. The algorithms used are Symbiotic Adaptive Neuro-Evolution (SANE) and the 1- and 2-layer Adaptive Heuristic Critic (AHC). These demonstrations are available at http://www.cs.hmc.edu/~jlawson/pole.

Introduction

In our World Wide Web search for information on the pole-balancing problem, we found a good deal of technical information about the problem and some of its solutions, but extremely little graphical demonstration of the theory. Specifically, while there seemed to be many Java applets demonstrating other types of neural-networks problems (flocking behavior, artificial life, etc.), pole-balancing applets were not to be found. What we did find was source code for a variety of pole-balancing algorithms in languages other than Java. Therefore we decided to convert these programs to Java’s applet format and place them on the web.

The three programs we have updated each use a different learning technique. The Adaptive Heuristic Critic (AHC) is a temporal difference network connected in a feedback loop to a "reinforcement learner" which stores value assessments based on the state of the system. The Symbiotic Adaptive Neuro-Evolution (SANE) algorithm uses genetic methods to "breed" neurons that are better at what they do. SANE generally performs at least an order of magnitude faster than AHC (Moriartey).

The Problem

All of our programs view the pole-balancing problem from the same perspective. A pole is anchored at one end on a "cart" which is free to move in one dimension. The pole is subject to gravity, causing it to fall to one side or the other. The program is in charge of moving the cart in one direction or the other in order to keep the pole from falling over. The successful solution to the problem will keep the pole from falling past a certain number of degrees from vertical.

The current state of the system is defined by four elements:

From this information, the algorithm determines which direction to push the cart. At each propagation through the network, the program moves the cart by exerting a constant, momentary "force" (an impulse) in one direction or the other. For instance, to make the cart sit still, alternating left and right impulses would be applied. Euler's method is used to approximate the continuous physical equations of motion using discrete time intervals.

Adaptive Heuristic Critic (AHC)

The AHC method combines a temporal difference neural network with a nicely ambiguous construct called a "reinforcement learner" (see diagram). The neural network component of the AHC system can be as complex as desired (our demonstrations include 1- and 2-layer networks). In our AHC program, the reinforcement learner is simply a table-lookup system. The state space (cart position, cart velocity, pole angle, pole angular velocity) is divided up into regions. Each region has an entry in the reinforcement table which represents the "value" of states within that region. Whenever the current state falls within a certain region, that region's value is updated to equal the temporal difference network's result for the state. The reinforcement learner's value assessments are fed into the AHC component (temporal difference network) along with the current state.

The "critic" component — the actual neural network itself — is trained using the TD(0) method. The network's assessed "value" for each state is updated to be closer to a theoretical "ideal" value, by the following relation:

V'(s) = V(s) + a (r + g V(s') - V(s))

where V(s) and V'(s) are the old and new value assessment for the previous state, V(s') is the value assessment for the current state, r is the instantaneous reward from the system as a result of the action just taken, and a and g are learning rate parameters. The new value assessment is theoretically more "correct" because it incorporates r, the actual reward from the system (Kaelbling).

The AHC method is apparently touchy, because the critic and reinforcement learner components will only converge together if the learning rates are adjusted properly. On the other hand, the SANE genetic algorithm has seen much greater success.

Symbiotic Adaptive Neuro-Evolution (SANE)

The SANE algorithm "breeds" neurons using a genetic algorithm. SANE manipulates the hidden layer of a two-layer feedforward network. A "population" of neurons is established, from which a random pool of "gene" neurons is selected. A network is created from these genes, and the solution arrived at by this network is evaluated. This solution is added to the "fitness" value of each gene. The selection-evaluation process repeats until every neuron in the population has participated some number of times. Then, the average fitness of each neuron in the population is found by dividing the neuron's total fitness by the number of trial networks it was used in. The neurons of highest average fitness are then "crossbred," resulting in offspring that are then added to the population, replacing the worst-performing neurons. A small amount of random mutation (at a rate of 0.1%) is added to the offspring in case some genetic material was not present in the original population or has been lost through crossbreeding. The whole process repeats. Each iteration through the selection-evaluation-mating process constitutes one generation.

The SANE algorithm has been tested against AHC and others, coming to a solution at 9 to 16 times the speed of AHC.

Bibliography

Charles W. Anderson and Zhaohui Hong. "Reinforcement Learning with Modular Neural Networks for Control." Department of Computer Science, Colorado State University, Fort Collins, CO 80523.

Leslie Pack Kaelbling and Michael L. Littman. "Reinforcement Learning: A Survey." Computer Science Department, Box 1910, Brown University, Providence, RI 02912-1910 USA. http://www.cs.washington.edu/research/jair/volume4/kaelbling96a-html/rl-survey.html.

David E. Moriarty and Risto Miikkulainen. "EFFICIENT REINFORCEMENT LEARNING THROUGH SYMBIOTIC EVOLUTION." Machine Learning, 22:11-32. Department of Computer Sciences, The University of Texas at Austin, 1996.