The ability of hidden units to increase the computational power of artificial neural networks was well known to Old Connectionism (McCulloch & Pitts, 1943). Its problem was that while a learning rule could be used to train networks with no hidden units (Rosenblatt, 1958, 1962), no such rule existed for multilayered networks. The reason that a learning rule did not exist for multilayered networks was because learning was defined in terms of minimizing the error of unit responses. While it was straightforward to define output unit error, no parallel definition existed for hidden unit error. A hidden unit’s error could not be defined because it was not related to any directly observable outcome (e.g., external behaviour). If a hidden unit’s error could not be defined, then Old Connectionist rules could not be used to modify its connections.
The need to define and compute hidden unit error is an example of the credit assignment problem:
In playing a complex game such as chess or checkers, or in writing a computer program, one has a definite success criterion—the game is won or lost. But in the course of play, each ultimate success (or failure) is associated with a vast number of internal decisions. If the run is successful, how can we assign credit for the success among the multitude of decisions? (Minsky, 1963, p. 432)
The credit assignment problem that faced Old Connectionism was the inability to assign the appropriate credit—or more to the point, the appropriate blame— to each hidden unit for its contribution to output unit error. Failure to solve this problem prevented Old Connectionism from discovering methods to make their most powerful networks belong to the domain of empiricism and led to its demise (Papert, 1988).
The rebirth of connectionist cognitive science in the 1980s (McClelland & Rumelhart, 1986; Rumelhart & McClelland, 1986c) was caused by the discovery of a solution to Old Connectionism’s credit assignment problem. By employing a nonlinear but continuous activation function, calculus could be used to explore changes in network behaviour (Rumelhart, Hinton, & Williams, 1986b). In particular, calculus could reveal how an overall network error was altered, by changing a component deep within the network, such as a single connection between an input unit and a hidden unit. This led to the discovery of the “backpropagation of error” learning rule, sometimes known as the generalized delta rule (Rumelhart Hinton, & Williams, 1986b). The calculus underlying the generalized delta rule revealed that hidden unit error could be defined as the sum of weighted errors being sent backwards through the network from output units to hidden units.
The generalized delta rule is an error-correcting method for training multilayered networks that shares many characteristics with the original delta rule for perceptrons (Rosenblatt, 1958, 1962; Widrow, 1962; Widrow & Hoff, 1960). A more detailed mathematical treatment of this rule, and its relationship to other connectionist learning rules, is provided by Dawson (2004). A less technical account of the rule is given below.
The generalized delta rule is used to train a multilayer perceptron to mediate a desired input-output mapping. It is a form of supervised learning, in which a finite set of input-output pairs is presented iteratively, in random order, during training. Prior to training, a network is a “pretty blank” slate; all of its connection weights, and all of the biases of its activation functions, are initialized as small, random numbers. The generalized delta rule involves repeatedly presenting input-output pairs and then modifying weights. The purpose of weight modification is to reduce overall network error.
A single presentation of an input-output pair proceeds as follows. First, the input pattern is presented, which causes signals to be sent to hidden units, which in turn activate and send signals to the output units, which finally activate to represent the network’s response to the input pattern. Second, the output unit responses are compared to the desired responses, and an error term is computed for each output unit. Third, an output unit’s error is used to modify the weights of its connections. This is accomplished by adding a weight change to the existing weight. The weight change is computed by multiplying four different numbers together: a learning rate, the derivative of the unit’s activation function, the output unit’s error, and the current activity at the input end of the connection. Up to this point, learning is functionally the same as performing gradient descent training on a perceptron (Dawson, 2004).
The fourth step differentiates the generalized delta rule from older rules: each hidden unit computes its error. This is done by treating an output unit’s error as if it were activity and sending it backwards as a signal through a connection to a hidden unit. As this signal is sent, it is multiplied by the weight of the connection. Each hidden unit computes its error by summing together all of the error signals that it receives from the output units to which it is connected. Fifth, once the hidden unit error has been computed, the weights of the hidden units can be modified using the same equation that was used to alter the weights of each of the output units.
This procedure can be repeated iteratively if there is more than one layer of hidden units. That is, the error of each hidden unit in one layer can be propagated backwards to an adjacent layer as an error signal once the hidden unit weights have been modified. Learning about this pattern stops once all of the connections have been modified. Then the next training pattern can be presented to the input units, and the learning process occurs again.
There are a variety of different ways in which the generic algorithm given above can be realized. For instance, in stochastic training, connection weights are updated after each pattern is presented (Dawson, 2004). This approach is called stochastic because each pattern is presented once per epoch of training, but the order of presentation is randomized for each epoch. Another approach, batch training, is to accumulate error over an epoch and to only update weights once at the end of the epoch, using accumulated error (Rumelhart, Hinton, & Williams, 1986a). As well, variations of the algorithm exist for different continuous activation functions. For instance, an elaborated error term is required to train units that have Gaussian activation functions, but when this is done, the underlying mathematics are essentially the same as in the original generalized delta rule (Dawson & Schopflocher, 1992b).
New Connectionism was born when the generalized delta rule was invented. Interestingly, the precise date of its birth and the names of its parents are not completely established. The algorithm was independently discovered more than once. Rumelhart, Hinton, and Williams (1986a, 1986b) are its most famous discoverers and popularizers. It was also discovered by David Parker in 1985 and by Yann LeCun in 1986 (Anderson, 1995). More than a decade earlier, the algorithm was reported in Paul Werbos’ (1974) doctoral thesis. The mathematical foundations of the generalized delta rule can be traced to an earlier decade, in a publication by Shun-Ichi Amari (1967).
In an interview (Anderson & Rosenfeld, 1998), neural network pioneer Stephen Grossberg stated that “Paul Werbos, David Parker, and Shun-Ichi Amari should have gotten credit for the backpropagation model, instead of Rumelhart, Hinton, and Williams” (pp. 179–180). Regardless of the credit assignment problem associated with the scientific history of this algorithm, it transformed cognitive science in the mid-1980s, demonstrating “how the lowly concepts of feedback and derivatives are the essential building blocks needed to understand and replicate higher-order phenomena like learning, emotion and intelligence at all levels of the human mind” (Werbos, 1994, p. 1).