# 4.16: New Powers of Old Networks

- Page ID
- 35744

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)The history of artificial neural networks can be divided into two periods, Old Connectionism and New Connectionism (Medler, 1998). New Connectionism studies powerful networks consisting of multiple layers of units, and connections are trained to perform complex tasks. Old Connectionism studied networks that belonged to one of two classes. One was powerful multilayer networks that were hand wired, not trained (McCulloch & Pitts, 1943). The other was less powerful networks that did not have hidden units but were trained (Rosenblatt, 1958, 1962; Widrow, 1962; Widrow & Hoff, 1960).

Perceptrons (Rosenblatt, 1958, 1962) belong to Old Connectionism. A perceptron is a standard pattern associator whose output units employ a nonlinear activation function. Rosenblatt’s perceptrons used the Heaviside step function to convert net input into output unit activity. Modern perceptrons use continuous nonlinear activation functions, such as the logistic or the Gaussian (Dawson, 2004, 2005, 2008; Dawson et al., 2009; Dawson et al., 2010).

Perceptrons are trained using an error-correcting variant of Hebb-style learning (Dawson, 2004). Perceptron training associates input activity with output unit error as follows. First, a pattern is presented to the input units, producing output unit activity via the existing connection weights. Second, output unit error is computed by taking the difference between actual output unit activity and desired output unit activity for each output unit in the network. This kind of training is called supervised learning, because it requires an external trainer to provide the desired output unit activities. Third, Hebb-style learning is used to associate input unit activity with output unit error: weight change is equal to a learning rate times input unit activity times output unit error. (In modern perceptrons, this triple product can also be multiplied by the derivative of the output unit’s activation function, resulting in gradient descent learning [Dawson, 2004]).

The supervised learning of a perceptron is designed to reduce output unit errors as training proceeds. Weight changes are proportional to the amount of generated error. If no errors occur, then weights are not changed. If a task’s solution can be represented by a perceptron, then repeated training using pairs of input-output stimuli is guaranteed to eventually produce zero error, as proven in Rosenblatt’s perceptron convergence theorem (Rosenblatt, 1962).

Being a product of Old Connectionism, there are limits to the range of input-output mappings that can be mediated by perceptrons. In their famous computational analyses of what perceptrons could and could not learn to compute, Minsky and Papert (1969) demonstrated that perceptrons could not learn to distinguish some basic topological properties easily discriminated by humans, such as the difference between connected and unconnected figures. As a result, interest in and funding for Old Connectionist research decreased dramatically (Medler, 1998; Papert, 1988).

However, perceptrons are still capable of providing new insights into phenomena of interest to cognitive science. The remainder of this section illustrates this by exploring the relationship between perceptron learning and classical conditioning.

The primary reason that connectionist cognitive science is related to empiricism is that the knowledge of an artificial neural network is typically acquired via experience. For instance, in supervised learning a network is presented with pairs of patterns that define an input-output mapping of interest, and a learning rule is used to adjust connection weights until the network generates the desired response to a given input pattern.

In the twentieth century, prior to the birth of artificial neural networks (McCulloch & Pitts, 1943), empiricism was the province of experimental psychology. A detailed study of classical conditioning (Pavlov, 1927) explored the subtle regularities of the law of contiguity. Pavlovian, or classical, conditioning begins with an unconditioned stimulus (US) that is capable, without training, of producing an unconditioned response (UR). Also of interest is a conditioned stimulus (CS) that when presented will not produce the UR. In classical conditioning, the CS is paired with the US for a number of trials. As a result of this pairing, which places the CS in contiguity with the UR, the CS becomes capable of eliciting the UR on its own. When this occurs, the UR is then known as the conditioned response (CR).

Classical conditioning is a very basic kind of learning, but experiments revealed that the mechanisms underlying it were more complex than the simple law of contiguity. For example, one phenomenon found in classical conditioning is blocking (Kamin, 1968). Blocking involves two conditioned stimuli, CSA and CSB. Either stimulus is capable of being conditioned to produce the CR. However, if training begins with a phase in which only CSA is paired with the US and is then followed by a phase in which both CSA and CSB are paired with the US, then CSB fails to produce the CR. The prior conditioning involving CSA blocks the conditioning of CSB, even though in the second phase of training CSB is contiguous with the UR.

The explanation of phenomena such as blocking required a new model of associative learning. Such a model was proposed in the early 1970s by Robert Rescorla and Allen Wagner (Rescorla & Wagner, 1972). This mathematical model of learning has been described as being cognitive, because it defines associative learning in terms of expectation. Its basic idea is that a CS is a signal about the likelihood that a US will soon occur. Thus the CS sets up expectations of future events. If these expectations are met, then no learning will occur. However, if these expectations are not met, then associations between stimuli and responses will be modified. “Certain expectations are built up about the events following a stimulus complex; expectations initiated by that complex and its component stimuli are then only modified when consequent events disagree with the composite expectation” (p. 75).

The expectation-driven learning that was formalized in the Rescorla-Wagner model explained phenomena such as blocking. In the second phase of learning in the blocking paradigm, the coming US was already signaled by CS_{A}. Because there was no surprise, no conditioning of CS_{B} occurred. The Rescorla-Wagner model has had many other successes; though it is far from perfect (Miller, Barnet, & Grahame, 1995; Walkenbach & Haddad, 1980), it remains an extremely influential, if not the most influential, mathematical model of learning.

The Rescorla-Wagner proposal that learning depends on the amount of surprise parallels the notion in supervised training of networks that learning depends on the amount of error. What is the relationship between Rescorla-Wagner learning and perceptron learning?

Proofs of the equivalence between the mathematics of Rescorla-Wagner learning and the mathematics of perceptron learning have a long history. Early proofs demonstrated that one learning rule could be translated into the other (Gluck & Bower, 1988; Sutton & Barto, 1981). However, these proofs assumed that the networks had linear activation functions. Recently, it has been proven that if when it is more properly assumed that networks employ a nonlinear activation function, one can still translate Rescorla-Wagner learning into perceptron learning, and vice versa (Dawson, 2008).

One would imagine that the existence of proofs of the computational equivalence between Rescorla-Wagner learning and perceptron learning would mean that perceptrons would not be able to provide any new insights into classical conditioning. However, this is not correct. Dawson (2008) has shown that if one puts aside the formal comparison of the two types of learning and uses perceptrons to simulate a wide variety of different classical conditioning paradigms, then some puzzling results occur. On the one hand, perceptrons generate the same results as the Rescorla-Wagner model for many different paradigms. Given the formal equivalence between the two types of learning, this is not surprising. On the other hand, for some paradigms, perceptrons generate different results than those predicted from the Rescorla-Wagner model (Dawson, 2008, Chapter 7). Furthermore, in many cases these differences represent improvements over Rescorla-Wagner learning. If the two types of learning are formally equivalent, then how is it possible for such differences to occur?

Dawson (2008) used this perceptron paradox to motivate a more detailed comparison between Rescorla-Wagner learning and perceptron learning. He found that while these two models of learning were equivalent at the computational level of investigation, there were crucial differences between them at the algorithmic level. In order to train a perceptron, the network must first behave (i.e., respond to an input pattern) in order for error to be computed to determine weight changes. In contrast, Dawson showed that the Rescorla-Wagner model defines learning in such a way that behaviour is not required!

Dawson’s (2008) algorithmic analysis of Rescorla-Wagner learning is consistent with Rescorla and Wagner’s (1972) own understanding of their model: “Independent assumptions will necessarily have to be made about the mapping of associative strengths into responding in any particular situation” (p. 75). Later, they make this same point much more explicitly:

We need to provide some mapping of [associative] values into behavior. We are not prepared to make detailed assumptions in this instance. In fact, we would assume that any such mapping would necessarily be peculiar to each experimental situation, and depend upon a large number of ‘performance’ variables. (Rescorla & Wagner, 1972, p. 77)

Some knowledge is tacit: we can know more than we can tell (Polanyi, 1966). Dawson (2008) noted that the Rescorla-Wagner model presents an interesting variant of this theme, where if there is no explicit need for a behavioural theory, then there is no need to specify it explicitly. Instead, researchers can ignore Rescorla and Wagner’s (1972) call for explicit models to convert associative strengths into behaviour and instead assume unstated, tacit theories such as “strong associations produce stronger, or more intense, or faster behavior.” Researchers evaluate the RescorlaWagner model (Miller, Barnet, & Grahame, 1995; Walkenbach & Haddad, 1980) by agreeing that associations will eventually lead to behaviour, without actually stating how this is done. In the Rescorla-Wagner model, learning comes first and behaviour comes later—maybe.

Using perceptrons to study classical conditioning paradigms contributes to the psychological understanding of such learning in three ways. First, at the computational level, it demonstrates equivalences between independent work on learning conducted in computer science, electrical engineering, and psychology (Dawson, 2008; Gluck & Bower, 1988; Sutton & Barto, 1981).

Second, the results of training perceptrons in these paradigms raise issues that lead to a more sophisticated understanding of learning theories. For instance, the perceptron paradox led to the realization that when the Rescorla-Wagner model is typically used, accounts of converting associations into behavior are unspecified. Recall that one of the advantages of computer simulation research is exposing tacit assumptions (Lewandowsky, 1993).

Third, the activation functions that are a required property of a perceptron serve as explicit theories of behavior to be incorporated into the Rescorla-Wagner model. More precisely, changes in activation function result in changes to how the perceptron responds to stimuli, indicating the importance of choosing a particular architecture (Dawson & Spetch, 2005). The wide variety of activation functions that are available for artificial neural networks (Duch & Jankowski, 1999) offers a great opportunity to explore how changing theories of behaviour—or altering architectures—affect the nature of associative learning.

The preceding paragraphs have shown how the perceptron can be used to inform theories of a very old psychological phenomenon, classical conditioning. We now consider how perceptrons can play a role in exploring a more modern topic, reorientation, which was described from a classical perspective in Chapter 3 (Section 3.12).