4.13: A Parallel Distributed Production System

Last updated
Save as PDF

Page ID: 35741

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

One of the prototypical architectures for classical cognitive science is the production system (Anderson, 1983; Kieras & Meyer, 1997; Meyer et al., 2001; Meyer & Kieras, 1997a, 1997b; Newell, 1973, 1990; Newell & Simon, 1972). A production system is a set of condition-action pairs. Each production works in parallel, scanning working memory for a pattern that matches its condition. If a production finds such a match, then it takes control, momentarily disabling the other productions, and performs its action, which typically involves adding, deleting, copying, or moving symbols in the working memory.

Production systems have been proposed as a lingua franca for cognitive science, capable of describing any connectionist or embodied cognitive science theory and therefore of subsuming such theories under the umbrella of classical cognitive science (Vera & Simon, 1993). This is because Vera and Simon (1993) argued that any situation-action pairing can be represented either as a single production in a production system or, for complicated situations, as a set of productions. “Productions provide an essentially neutral language for describing the linkages between information and action at any desired (sufficiently high) level of aggregation” (p. 42). Other philosophers of cognitive science have endorsed similar positions. For instance, von Eckardt (1995) suggested that if one considers distributed representations in artificial neural networks as being “higher-level” representations, then connectionist networks can be viewed as being analogous to classical architectures. This is because when examined at this level, connectionist networks have the capacity to input and output represented information, to store represented information, and to manipulate represented information. In other words, the symbolic properties of classical architectures may emerge from what are known as the subsymbolic properties of networks (Smolensky, 1988).

However, the view that artificial neural networks are classical in general or examples of production systems in particular is not accepted by all connectionists. It has been claimed that connectionism represents a Kuhnian paradigm shift away from classical cognitive science (Schneider, 1987). With respect to Vera and Simon’s (1993) particular analysis, their definition of symbol has been deemed too liberal by some neural network researchers (Touretzky & Pomerleau, 1994). Touretzky and Pomerlau (1994) claimed of a particular neural network discussed by Vera and Simon, ALVINN (Pomerleau, 1991), that its hidden unit “patterns are not arbitrarily shaped symbols, and they are not combinatorial. Its hidden unit feature detectors are tuned filters” (Touretzky & Pomerleau, 1994, p. 348). Others have viewed ALVINN from a position of compromise, noting that “some of the processes are symbolic and some are not” (Greeno & Moore, 1993, p. 54).

Are artificial neural networks equivalent to production systems? In the philosophy of science, if two apparently different theories are in fact identical, then one theory can be translated into the other. This is called intertheoretic reduction (Churchland, 1985, 1988; Hooker, 1979, 1981). The widely accepted view that classical and connectionist cognitive science are fundamentally different (Schneider, 1987) amounts to the claim that intertheoretic reduction between a symbolic model and a connectionist network is impossible. One research project (Dawson et al., 2000) directly examined this issue by investigating whether a production system model could be translated into an artificial neural network.

Dawson et al. (2000) investigated intertheoretic reduction using a benchmark problem in the machine learning literature, classifying a very large number (8,124) of mushrooms as being either edible or poisonous on the basis of 21 different features (Schlimmer, 1987). Dawson et al. (2000) used a standard machine learning technique, the ID3 algorithm (Quinlan, 1986) to induce a decision tree for the mushroom problem. A decision tree is a set of tests that are performed in sequence to classify patterns. After performing a test, one either reaches a terminal branch of the tree, at which point the pattern being tested can be classified, or a node of the decision tree, which is to say another test that must be performed. The decision tree is complete for a pattern set if every pattern eventually leads the user to a terminal branch. Dawson et al. (2000) discovered that a decision tree consisting of only five different tests could solve the Schlimmer mushroom classification task. Their decision tree is provided in Table \(\PageIndex{1}\).

**Table** \(\PageIndex{1}\). *Dawson et al.'s (2000) step decision tree for lassifying mushrooms. Decision points in this tree where mushrooms are classified (e.g., "Rule 1 Edible") are given in bold.*
Step	Tests and Decision Points
1	What is the mushroom’s odour? If it is almond or anise then it is edible. (Rule 1 Edible) If it is creosote or fishy or foul or musty or pungent or spicy then it is poisonous. (Rule 1 Poisonous) If it has no odour then proceed to Step 2.
2	Obtain the spore print of the mushroom. If the spore print is black or brown or buff or chocolate or orange or yellow then it is edible. (Rule 2 Edible) If the spore print is green or purple then it is poisonous. (Rule 2 Poisonous) If the spore print is white then proceed to Step 3.
3	Examine the gill size of the mushroom. If the gill size is broad, then it is edible. (Rule 3 Edible) If the gill size is narrow, then proceed to Step 4.
4	Examine the stalk surface above the mushroom’s ring. If the surface is fibrous then it is edible. (Rule 4 Edible) If the surface is silky or scaly then it is poisonous. (Rule 4 Poisonous) If the surface is smooth then proceed to Step 5.
5	Examine the mushroom for bruises. If it has no bruises then it is edible. (Rule 5 Edible) If it has bruises then it is poisonous. (Rule 5 Poisonous)

The decision tree provided in Table \(\PageIndex{1}\) is a classical theory of how mushrooms can be classified. It is not surprising, then, that one can translate this decision tree into the lingua franca: Dawson et al. (2000) rewrote the decision tree as an equivalent set of production rules. They did so by using the features of mushrooms that must be true at each terminal branch of the decision tree as the conditions for a production. The action of this production is to classify the mushroom (i.e., to assert that a mushroom is either edible or poisonous). For instance, at the Rule 1 Edible decision point in Table \(\PageIndex{1}\), one could create the following production rule: “If the odour is anise or almond, then the mushroom is edible.” Similar productions can be created for later decision points in the algorithm; these productions will involve a longer list of mushroom features. The complete set of productions that were created for the decision tree algorithm is provided in Table \(\PageIndex{2}\).

Dawson et al. (2000) trained a network of value units to solve the mushroom classification problem and to determine whether a classical model (such as the decision tree from Table \(\PageIndex{1}\) or the production system from Table \(\PageIndex{2}\)) could be translated into a network. To encode mushroom features, their network used 21 input units, 5 hidden value units, and 10 output value units. One output unit encoded the edible/poisonous classification—if a mushroom was edible, this unit was trained to turn on; otherwise this unit was trained to turn off.

Decision Point From Table \(\PageIndex{1}\)	Equivalent Production	Network Cluster
Rule 1 Edible	P1: if (odor=anise)\(\lor\)(odor=almond)→edible	2 or 3
Rule 1 Poisonous	P2: if (odor\(\neq\)anise) \(\land\)(odor\(\neq\)almond) \(\land\)(odor\(\neq\)none) → not edible	1
Rule 2 Edible	P3: if (odor=none) \(\land\) (spore print color\(\neq\)green) \(\land\) (spore print color\(\neq\)purple) \(\land\) (spore print color=white) → edible	9
Rule 2 Poisonous	P4: if (odor=none) \(\land\) ((spore print color=green)\ (\lor\) (spore print color=purple) → not edible	6
Rule 3 Edible	P5: if (odor=none) \(\land\) (spore print color=white) \(\land\) (gill size=broad) → edible	4
Rule 4 Edible	P6: if (odor=none) \(\land\) (spore print color=white) \(\land\) (gill size=narrow) \(\land\) (stalk surface above ring=fibrous) → edible	7 or 11
Rule 4 Poisonous	P7: if(odor=none) \(\land\) (spore print color=white) \(\land\) (gill size=narrow) \(\land\) ((stalk surface above ring=silky) \(\lor\) (stalk surface above ring=scaly)) → edible	5
Rule 5 Edible	P8: if (odor=none) \(\land\) (spore print color=white) \(\land\) (gill size=narrow) \(\land\) (stalk surface above ring=smooth) \(\land\) (bruises=no) → edible	8 or 12
Rule 5 Poisonous	P9: if (odor=none) \(\land\) (spore print color=white) \(\land\) (gill size=narrow) \(\land\) (stalk surface above ring=smooth) \(\land\) (bruises=yes) → not edible	10

Table \(\PageIndex{2}\). Dawson et al.’s (2000) production system translation of Table 4-4. Conditions are given as sets of features. The Network Cluster column pertains to their artificial neural network trained on the mushroom problem and is described later in the text.

The other nine output units were used to provide extra output learning, which was the technique employed to insert a classical theory into the network. Normally, a pattern classification system is only provided with information about what correct pattern labels to assign. For instance, in the mushroom problem, the system would typically only be taught to generate the label edible or the label poisonous. However, more information about the pattern classification task is frequently available. In particular, it is often known why an input pattern belongs to one class or another. It is possible to incorporate this information to the pattern classification problem by teaching the system not only to assign a pattern to a class (e.g., “edible”, “poisonous”) but to also generate a reason for making this classification (e.g., “passed Rule 1”, “failed Rule 4”). Elaborating a classification task along such lines is called the injection of hints or extra output learning (Abu-Mostafa, 1990; Suddarth & Kergosien, 1990).

Dawson et al. (2000) hypothesized that extra output learning could be used to insert the decision tree from Table \(\PageIndex{1}\) into a network. Table \(\PageIndex{1}\) provides nine different terminal branches of the decision tree at which mushrooms are assigned to categories (“Rule 1 edible”, “Rule 1 poisonous”, “Rule 2 edible”, etc.). The network learned to “explain” why it classified an input pattern in a particular way by turning on one of the nine extra output units to indicate which terminal branch of the decision tree was involved. In other words, the network (which required 8,699 epochs of training on the 8,124 different input patterns!) classified networks “for the same reasons” as would the decision tree. This is why Dawson et al. hoped that this classical theory would literally be translated into the network.

Apart from the output unit behavior, how could one support the claim that a classical theory had been translated into a connectionist network? Dawson et al. (2000) interpreted the internal structure of the network in an attempt to see whether such a network analysis would reveal an internal representation of the classical algorithm. If this were the case, then standard training practices would have succeeded in translating the classical algorithm into a PDP network.

One method that Dawson et al. (2000) used to interpret the trained network was a multivariate analysis of the network’s hidden unit space. They represented each mushroom as the vector of five hidden unit activation values that it produced when presented to the network. They then performed a k-means clustering of this data. The k-means clustering is an iterative procedure that assigns data points to k different clusters in such a way that each member of a cluster is closer to the centroid of that cluster than to the centroid of any other cluster to which other data points have been assigned.

However, whenever cluster analysis is performed, one question that must be answered is How many clusters should be used?—in other words, what should the value of k be?. An answer to this question is called a stopping rule. Unfortunately, no single stopping rule has been agreed upon (Aldenderfer & Blashfield, 1984; Everitt, 1980). As a result, there exist many different types of methods for determining k (Milligan & Cooper, 1985).

While no general method exists for determining the optimal number of clusters, one can take advantage of heuristic information concerning the domain being clustered in order to come up with a satisfactory stopping rule for this domain. Dawson et al. (2000) argued that when the hidden unit activities of a trained network are being clustered, there must be a correct mapping from these activities to output responses, because one trained network itself has discovered one such mapping. They used this position to create the following stopping rule: “Extract the smallest number of clusters such that every hidden unit activity vector assigned to the same cluster produces the same output response in the network.” They used this rule to determine that the k-means analysis of the network’s hidden unit activity patterns required the use of 12 different clusters.

Dawson et al. (2000) then proceeded to examine the mushroom patterns that belonged to each cluster in order to determine what they had in common. For each cluster, they determined the set of descriptive features that each mushroom shared. They realized that each set of shared features they identified could be thought of as a condition, represented internally by the network as a vector of hidden unit activities, which results in the network producing a particular action, in particular, the edible/poisonous judgment represented by the first output unit.

For example, mushrooms that were assigned to Cluster 2 had an odour that was either almond or anise, which is represented by the network’s five hidden units adopting a particular vector of activities. These activities serve as a condition that causes the network to assert that the mushroom is edible.

By interpreting a hidden unit vector in terms of condition features that are prerequisites to network responses, Dawson et al. (2000) discovered an amazing relationship between the clusters and the set of productions in Table \(\PageIndex{2}\). They determined that each distinct class of hidden unit activities (i.e., each cluster) corresponded to one, and only one, of the productions listed in the table. This mapping is provided in the last column of Table \(\PageIndex{2}\). In other words, when one describes the network as generating a response because its hidden units are in one state of activity, one can translate this into the claim that the network is executing a particular production. This shows that the extra output learning translated the classical algorithm into a network model.

The translation of a network into a production system, or vice versa, is an example of new wave reductionism (Bickle, 1996; Endicott, 1998). In new wave reductionism, one does not reduce a secondary theory directly to a primary theory. Instead, one takes the primary theory and constructs from it a structure that is analogous to the secondary theory, but which is created in the vocabulary of the primary theory. Theory reduction involves constructing a mapping between the secondary theory and its image constructed from the primary theory. “The older theory, accordingly, is never deduced; it is just the target of a relevantly adequate mimicry” (Churchland, 1985, p. 10).

Dawson et al.’s (2000) interpretation is a new wave intertheoretic reduction because the production system of Table \(\PageIndex{2}\) represents the intermediate structure that is analogous to the decision tree of Table \(\PageIndex{1}\). “Adequate mimicry” was established by mapping different classes of hidden unit states to the execution of particular productions. In turn, there is a direct mapping from any of the productions back to the decision tree algorithm. Dawson et al. concluded that they had provided an exact translation of a classical algorithm into a network of value units.

The relationship between hidden unit activities and productions in Dawson et al.’s (2000) mushroom network is in essence an example of equivalence between symbolic and subsymbolic accounts. This implies that one cannot assume that classical models and connectionist networks are fundamentally different at the algorithmic level, because one type of model can be translated into the other. It is possible to have a classical model that is exactly equivalent to a PDP network.

This result provides very strong support for the position proposed by Vera and Simon (1993). The detailed analysis provided by Dawson et al. (2000) permitted them to make claims of the type “Network State \(x\) is equivalent to Production \(y\).” Of course, this one result cannot by itself validate Vera and Simon’s argument. For instance, can any classical theory be translated into a network? This is one type of algorithmic-level issue that requires a great deal of additional research. As well, the translation works both ways: perhaps artificial neural networks provide a biologically plausible lingua franca for classical architectures!