# 4.4: Nonlinear Transformations

- Page ID
- 21224

John Stuart Mill modified his father’s theory of associationism (Mill & Mill, 1869; Mill, 1848) in many ways, including proposing a mental chemistry “in which it is proper to say that the simple ideas generate, rather than . . . compose, the complex ones” (Mill, 1848, p. 533). Mill’s mental chemistry is an early example of emergence, where the properties of a whole (i.e., a complex idea) are more than the sum of the properties of the parts (i.e., a set of associated simple ideas).

*The generation of one class of mental phenomena from another, whenever it can be made out, is a highly interesting fact in mental chemistry; but it no more supersedes the necessity of an experimental study of the generated phenomenon than a knowledge of the properties of oxygen and sulphur enables us to deduce those of sulphuric acid without specific observation and experiment. (Mill, 1848, p. 534) *

Mathematically, emergence results from nonlinearity (Luce, 1999). If a system is linear, then its whole behaviour is exactly equal to the sum of the behaviours of its parts. The standard pattern associator that was illustrated in Figure 4-1 is an example of such a system. Each output unit in the standard pattern associator computes a net input, which is the sum of all of the individual signals that it receives from the input units. Output unit activity is exactly equal to net input. In other words, output activity is exactly equal to the sum of input signals in the standard pattern associator. In order to increase the power of this type of pattern associator—in order to facilitate emergence—a nonlinear relationship between input and output must be introduced.

Neurons demonstrate one powerful type of nonlinear processing. The inputs to a neuron are weak electrical signals, called graded potentials, which stimulate and travel through the dendrites of the receiving neuron. If enough of these weak graded potentials arrive at the neuron’s soma at roughly the same time, then their cumulative effect disrupts the neuron’s resting electrical state. This results in a massive depolarization of the membrane of the neuron’s axon, called an action potential, which is a signal of constant intensity that travels along the axon to eventually stimulate some other neuron.

A crucial property of the action potential is that it is an all-or-none phenomenon, representing a nonlinear transformation of the summed graded potentials. The neuron converts continuously varying inputs into a response that is either on (action potential generated) or off (action potential not generated). This has been called the all-or-none law (Levitan & Kaczmarek, 1991, p. 43): “The all-or-none law guarantees that once an action potential is generated it is always full size, minimizing the possibility that information will be lost along the way.” The all-or-none output of neurons is a nonlinear transformation of summed, continuously varying input, and it is the reason that the brain can be described as digital in nature (von Neumann, 1958).

The all-or-none behaviour of a neuron makes it logically equivalent to the relays or switches that were discussed in Chapter 2. This logical interpretation was exploited in an early mathematical account of the neural information processing (McCulloch & Pitts, 1943). McCulloch and Pitts used the all-or-none law to justify describing neurons very abstractly as devices that made true or false logical assertions about input information:

*The all-or-none law of nervous activity is sufficient to insure that the activity of any neuron may be represented as a proposition. Physiological relations existing among nervous activities correspond, of course, to relations among the propositions; and the utility of the representation depends upon the identity of these relations with those of the logical propositions. To each reaction of any neuron there is a corresponding assertion of a simple proposition. (McCulloch & Pitts, 1943, p. 117) *

McCulloch and Pitts (1943) invented a connectionist processor, now known as the McCulloch-Pitts neuron (Quinlan, 1991), that used the all-or-none law. Like the output units in the standard pattern associator (Figure 4-1), a McCulloch-Pitts neuron first computes its net input by summing all of its incoming signals. However, it then uses a nonlinear activation function to transform net input into internal activity. The activation function used by McCulloch and Pitts was the Heaviside step function, named after nineteenth-century electrical engineer Oliver Heaviside. This function compares the net input to a threshold. If the net input is less than the threshold, the unit’s activity is equal to 0. Otherwise, the unit’s activity is equal to 1. (In other artificial neural networks [Rosenblatt, 1958, 1962], below-threshold net inputs produced activity of –1.)

The output units in the standard pattern associator (Figure 4-1) can be described as using the linear identity function to convert net input into activity, because output unit activity is equal to net input. If one replaced the identity function with the Heaviside step function in the standard pattern associator, it would then become a different kind of network, called a perceptron (Dawson, 2004), which was invented by Frank Rosenblatt during the era in which cognitive science was born (Rosenblatt, 1958, 1962).

Perceptrons (Rosenblatt, 1958, 1962) were artificial neural networks that could be trained to be pattern classifiers: given an input pattern, they would use their nonlinear outputs to decide whether or not the pattern belonged to a particular class. In other words, the nonlinear activation function used by perceptrons allowed them to assign perceptual predicates; standard pattern associators do not have this ability. The nature of the perceptual predicates that a perceptron could learn to assign was a central issue in an early debate between classical and connectionist cognitive science (Minsky & Papert, 1969; Papert, 1988).

The Heaviside step function is nonlinear, but it is also discontinuous. This was problematic when modern researchers sought methods to train more complex networks. Both the standard pattern associator and the perceptron are one-layer networks, meaning that they have only one layer of connections, the direct connections between input and output units (Figure 4-1). More powerful networks arise if intermediate processors, called hidden units, are used to preprocess input signals before sending them on to the output layer. However, it was not until the mid1980s that learning rules capable of training such networks were invented (Ackley, Hinton, & Sejnowski, 1985; Rumelhart, Hinton, & Williams, 1986b). The use of calculus to derive these new learning rules became possible when the discontinuous Heaviside step function was replaced by a continuous approximation of the all-ornone law (Rumelhart, Hinton, & Williams, 1986b).

One continuous approximation of the Heaviside step function is the sigmoidshaped logistic function. It asymptotes to a value of 0 as its net input approaches negative infinity, and asymptotes to a value of 1 as its net input approaches positive infinity. When the net input is equal to the threshold (or bias) of the logistic, activity is equal to 0.5. Because the logistic function is continuous, its derivative can be calculated, and calculus can be used as a tool to derive new learning rules (Rumelhart, Hinton, & Williams, 1986b). However, it is still nonlinear, so logistic activities can still be interpreted as truth values assigned to propositions.

Modern connectionist networks employ many different nonlinear activation functions. Processing units that employ the logistic activation function have been called integration devices (Ballard, 1986) because they convert a sum (net input) and “squash” it into the range between 0 and 1. Other processing units might be tuned to generate maximum responses to a narrow range of net inputs. Ballard (1986) called such processors value units. A different nonlinear continuous function, the Gaussian equation, can be used to mathematically define a value unit, and calculus can be used to derive a learning rule for this type of artificial neural network (Dawson, 1998, 2004; Dawson & Schopflocher, 1992b).

Many other activation functions exist. One review paper has identified 640 different activation functions employed in connectionist networks (Duch & Jankowski, 1999). One characteristic of the vast majority of all of these activation functions is their nonlinearity. Connectionist cognitive science is associationist, but it is also nonlinear.