8.7: Indexing Objects in the World

Last updated
Save as PDF

Page ID: 41276

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Pylyshyn’s theory of visual cognition began in the late 1970s with his interest in explaining how diagrams were used in reasoning (Pylyshyn, 2007). Pylyshyn and his colleagues attempted to investigate this issue by building a computer simulation that would build and inspect diagrams as part of deriving proofs in plane geometry.

From the beginning, the plans for this computer simulation made contact with two of the key characteristics of embodied cognitive science. First, the diagrams created and used by the computer simulation were intended to be external to it and to scaffold the program’s geometric reasoning.

Since we wanted the system to be as psychologically realistic as possible we did not want all aspects of the diagram to be ‘in its head’ but, as in real geometry problemsolving, remain on the diagram it was drawing and examining. (Pylyshyn, 2007, p. 10)

Second, the visual system of the computer was also assumed to be psychologically realistic in terms of its embodiment. In particular, the visual system was presumed to be a moving fovea that was of limited order: it could only examine the diagram in parts, rather than all at once.

We also did not want to assume that all properties of the entire diagram were available at once, but rather that they had to be noticed over time as the diagram was being drawn and examined. If the diagram were being inspected by moving the eyes, then the properties should be within the scope of the moving fovea. (Pylyshyn, 2007, p. 10)

These two intersections with embodied cognitive science—a scaffolding visual world and a limited order embodiment—immediately raised a fundamental information processing problem. As different lines or vertices were added to a diagram, or as these components were scanned by the visual system, their different identities had to be maintained or tracked over time. In order to function as intended, the program had to be able to assert, for example, that “this line observed here” is the same as “that line observed there” when the diagram is being scanned. In short, in considering how to create this particular system, Pylyshyn recognized that it required two core abilities: to be able to individuate visual entities, and to be able to track or maintain the identities of visual entities over time.

To maintain the identities of individuated elements over time is to solve the correspondence problem. How does one keep track of the identities of different entities perceived in different glances? According to Pylyshyn (2003b, 2007), the classical answer to this question must appeal to the contents of representations. To assert that some entity seen in a later glance was the same as one observed earlier, the descriptions of the current and earlier entities must be compared. If the descriptions matched, then the entities should be deemed to be the same. This is called the image matching solution to the correspondence problem, which also dictates how entities must be individuated: they must be uniquely described, when observed, as a set of properties that can be represented as a mental description, and which can be compared to other descriptions.

Pylyshyn rejects the classical image matching solution to the correspondence problem for several reasons. First, multiple objects can be tracked as they move to different locations, even if they are identical in appearance (Pylyshyn & Storm, 1988). In fact, multiple objects can be tracked as their properties change, even when their location is constant and shared (Blaser, Pylyshyn, & Holcombe, 2000). These results pose problems for image matching, because it is difficult to individuate and track identical objects by using their descriptions!

Second, the poverty of the stimulus in a dynamic world poses severe challenges to image matching. As objects move in the world or as we (or our eyes) change position, a distal object’s projection as a proximal stimulus will change properties, even though the object remains the same. “If objects can change their properties, we don’t know under what description the object was last stored” (Pylyshyn, 2003b, p. 205).

A third reason to reject image matching comes from the study of apparent motion, which requires the correspondence problem to be solved before the illusion of movement between locations can be added (Dawson, 1991; Wright & Dawson, 1994). Studies of apparent motion have shown that motion correspondence is mostly insensitive to manipulations of figural properties, such as shape, colour, or spatial frequency (Baro & Levinson, 1988; Cavanagh, Arguin, & von Grunau, 1989; Dawson, 1989; Goodman, 1978; Kolers, 1972; Kolers & Green, 1984; Kolers & Pomerantz, 1971; Kolers & von Grunau, 1976; Krumhansl, 1984; Navon, 1976; Victor & Conte, 1990). This insensitivity to form led Nelson Goodman (1978, p. 78) to conclude that “plainly the visual system is persistent, inventive, and sometimes rather perverse in building a world according to its own lights.” One reason for this perverseness may be that the neural circuits for processing motion are largely independent of those for processing form (Botez, 1975; Livingstone & Hubel, 1988; Maunsell & Newsome, 1987; Ungerleider & Mishkin, 1982).

A fourth reason to reject image matching is that it is a purely cognitive approach to individuating and tracking entities. “Philosophers typically assume that in order to individuate something we must conceptualize its relevant properties. In other words, we must first represent (or cognize or conceptualize) the relevant conditions of individuation” (Pylyshyn, 2007, p. 31). Pylyshyn rejected this approach because it suffers from the same core problem as the New Look: it lacks causal links to the world.

Pylyshyn’s initial exploration of how diagrams aided reasoning led to his realization that the individuation and tracking of visual entities are central to an account of how vision links us to the world. For the reasons just presented, he rejected a purely classical approach—mental descriptions of entities—for providing these fundamental abilities. He proposed instead a theory that parallels the structure of the examples of visual cognition described earlier. That is, Pylyshyn’s (2003b, 2007) theory of visual cognition includes a non-cognitive component (early vision), which delivers representations that can be accessed by visual attention (visual cognition), which in turn deliver representations that can be linked to general knowledge of the world (cognition).

On the one hand, the early vision component of Pylyshyn’s (2003b, 2007) theory of visual cognition is compatible with natural computation accounts of perception (Ballard, 1997; Marr, 1982). For Pylyshyn, the role of early vision is to provide causal links between the world and the perceiving agent without invoking cognition or inference:

Only a highly constrained set of properties can be selected by early vision, or can be directly ‘picked up.’ Roughly, these are what I have elsewhere referred to as ‘transducable’ properties. These are the properties whose detection does not require accessing memory and drawing inferences. (Pylyshyn, 2003b, p. 163)

The use of natural constraints to deliver representations such as the primal sketch and the 2½-D sketch is consistent with Pylyshyn’s view.

On the other hand, Pylyshyn (2003b, 2007) added innovations to traditional natural computation theories that have enormous implications for explanations of seeing and visualizing. First, Pylyshyn argued that one of the primitive processes of early vision is individuation—the picking out of an entity as being distinct from others. Second, he used evidence from feature integration theory and cognitive neuroscience to claim that individuation picks out objects, but not on the basis of their locations. That is, preattentive processes can detect elements or entities via primitive features but simultaneously not deliver the location of the features, as is the case in pop-out. Third, Pylyshyn argued that an individuated entity—a visual object—is preattentively tagged by an index, called a FINST (“for finger instantiation”), which can only be used to access an individuated object (e.g., to retrieve its properties when needed). Furthermore, only a limited number (four) of FINSTs are available. Fourth, once assigned to an object, a FINST remains attached to it even as the object changes its location or other properties. Thus a primitive component of early vision is the solution of the correspondence problem, where the role of this solution is to maintain the link between FINSTs and dynamic, individuated objects.

The revolutionary aspect of FINSTs is that they are presumed to individuate and track visual objects without delivering a description of them and without fixing their location. Pylyshyn (2007) argued that this is the visual equivalent of the use of indexicals or demonstratives in language: “Think of demonstratives in natural language—typically words like this or that. Such words allow us to refer to things without specifying what they are or what properties they have” (p. 18). FINSTs are visual indices that operate in exactly this way. They are analogous to placing a finger on an object in the world, and, while not looking, keeping the finger in contact with it as the object moved or changed— thus the term finger instantiation. As long as the finger is in place, the object can be referenced (“this thing that I am pointing to now”), even though the finger does not deliver any visual properties.

There is a growing literature that provides empirical support for Pylyshyn’s FINST hypothesis. Many of these experiments involve the multiple object tracking paradigm (Flombaum, Scholl, & Pylyshyn, 2008; Franconeri et al., 2008; Pylyshyn, 2006; Pylyshyn & Annan, 2006; Pylyshyn et al., 2008; Pylyshyn & Storm, 1988; Scholl, Pylyshyn, & Feldman, 2001; Sears & Pylyshyn, 2000). In the original version of this paradigm (Pylyshyn & Storm, 1988), subjects were shown a static display made up of a number of objects of identical appearance. A subset of these objects blinked for a short period of time, indicating that they were to-be-tracked targets. Then the blinking stopped, and all objects in the display began to move independently and randomly for a period of about ten seconds. Subjects had the task of tracking the targets, with attention only; a monitor ended trials in which eye movements were detected. At the end of a trial, one object blinked and subjects had to indicate whether or not it was a target.

The results of this study (see Pylyshyn & Storm, 1988) indicated that subjects could simultaneously track up to four independently moving targets with high accuracy. Multiple object tracking results are explained by arguing that FINSTs are allocated to the flashing targets prior to movement, and objects are tracked by the primitive mechanism that maintains the link from visual object to FINST. This link permits subjects to judge targethood at the end of a trial.

The multiple object tracking paradigm has been used to explore some of the basic properties of the FINST mechanism. Analyses indicate that this process is parallel, because up to four objects can be tracked, and tracking results cannot be explained by a model that shifts a spotlight of attention serially from target to target (Pylyshyn & Storm, 1988). However, the fact that no more than four targets can be tracked also shows that this processing has limited capacity. FINSTs are assigned to objects, and not locations; objects can be tracked through a location-less feature space (Blase, Pylyshyn, & Holcombe, 2000). Using features to make the objects distinguishable from one another does not aid tracking, and object properties can actually change during tracking without subjects being aware of the changes (Bahrami, 2003; Pylyshyn, 2007). Thus FINSTs individuate and track visual objects but do not deliver descriptions of the properties of the objects that they index.

Another source of empirical support for the FINST hypothesis comes from studies of subitizing (Trick & Pylyshyn, 1993, 1994). Subitizing is a phenomenon in which the number of items in a set of objects (the cardinality of the set) can be effortlessly and rapidly detected if the set has four or fewer items (Jensen, Reese, & Reese, 1950; Kaufman et al., 1949). Larger sets cannot be subitized; a much slower process is required to serially count the elements of larger sets. Subitizing necessarily requires that the items to be counted are individuated from one another. Trick and Pylyshyn (1993, 1994) hypothesized that subitizing could be accomplished by the FINST mechanism; elements are preattentively individuated by being indexed, and counting simply requires accessing the number of indices that have been allocated.

Trick and Pylyshyn (1993, 1994) tested this hypothesis by examining subitizing in conditions in which visual indexing was not possible. For instance, if the objects in a set are defined by conjunctions of features, then they cannot be preattentively FINSTed. Importantly, they also cannot be subitized. In general, subitizing does not occur when the elements of a set that are being counted are defined by properties that require serial, attentive processing in order to be detected (e.g., sets of concentric contours that have to be traced in order to be individuated; or sets of elements defined by being on the same contour, which also require tracing to be identified).

At the core of Pylyshyn’s (2003b, 2007) theory of visual cognition is the claim that visual objects can be preattentively individuated and indexed. Empirical support for this account of early vision comes from studies of multiple object tracking and of subitizing. The need for such early visual processing comes from the goal of providing causal links between the world and classical representations, and from embodying vision in such a way that information can only be gleaned a glimpse at a time. Thus Pylyshyn’s theory of visual cognition, as described to this point, has characteristics of both classical and embodied cognitive science. How does the theory make contact with connectionist cognitive science? The answer to this question comes from examining Pylyshyn’s (2003b, 2007) proposals concerning preattentive mechanisms for individuating visual objects and tracking them. The mechanisms that Pylyshyn proposed are artificial neural networks.

For instance, Pylyshyn (2000, 2003b) noted that a particular type of artificial neural network, called a winner-take-all network (Feldman & Ballard, 1982), is ideally suited for preattentive individuation. Many versions of such a network have been proposed to explain how attention can be automatically drawn to an object or to a distinctive feature (Fukushima, 1986; Gerrissen, 1991; Grossberg, 1980; Koch & Ullman, 1985; LaBerge Carter, & Brown, 1992; Sandon, 1992). In a winner-take-all network, an array of processing units is assigned to different objects or to feature locations. For instance, these processors could be distributed across the preattentive feature maps in feature integration theory (Treisman, 1988; Treisman & Gelade, 1980). Typically, a processor will have an excitatory connection to itself and will have inhibitory connections to its neighbouring processors. This pattern of connectivity results in the processor that receives the most distinctive input becoming activated and at the same time turning off its neighbours.

That such mechanisms might be involved in individuation is supported by results that show that the time course of visual search can be altered by visual manipulations that affect the inhibitory processing of such networks (Dawson & Thibodeau, 1998). Pylyshyn endorses a modified winner-take-all network as a mechanism for individuation; the modification permits an object indexed by the network to be interrogated in order to retrieve its properties (Pylyshyn, 2000).

Another intersection between Pylyshyn’s (2003b, 2007) theory of visual cognition and connectionist cognitive science comes from his proposals about preattentive tracking. How can such tracking be accomplished without the use of image matching? Again, Pylyshyn noted that artificial neural networks, such as those that have been proposed for solving the motion correspondence problem (Dawson, 1991; Dawson, Nevin-Meadows, & Wright, 1994; Dawson & Pylyshyn, 1988; Dawson & Wright, 1994), would serve as tracking mechanisms. This is because such models belong to the natural computation approach and have shown how tracking can proceed preattentively via the exploitation of natural constraints that are implemented as patterns of connectivity amongst processing units.

Furthermore, Dawson (1991) has argued that many of the regularities that govern solutions to the motion correspondence problem are consistent with the hypothesis that solving this problem is equivalent to tracking assigned visual tags. For example, consider some observations concerning the location of motion correspondence processing and attentional tracking processes in the brain. Dawson argued that motion correspondence processing is most likely performed by neurons located in Area 7 of the parietal cortex, on the basis of motion signals transmitted from earlier areas, such as the motion-sensitive area MT. Area 7 of the parietal cortex is also a good candidate for the locus of tracking of individuated entities.

First, many researchers have observed cells that appear to mediate object tracking in Area 7, such as visual fixation neurons and visual tracking neurons. Such cells are not evident earlier in the visual pathway (Goldberg & Bruce, 1985; Hyvarinen & Poranen, 1974; Lynch et al., 1977; Motter & Mountcastle, 1981; Robinson, Goldberg, & Stanton, 1978; Sakata et al., 1985).

Second, cells in this area are also governed by extraretinal (i.e., attentional) influences—they respond to attended targets, but not to unattended targets, even when both are equally visible (Robinson, Goldberg, & Stanton, 1978). This is required of mechanisms that can pick out and track targets from identically shaped distractors, as in a multiple object tracking task.

Third, Area 7 cells that appear to be involved in tracking appear to be able to do so across sensory modalities. For instance, hand projection neurons respond to targets to which hand movements are to be directed and do not respond when either the reach or the target are present alone (Robinson Goldberg, & Stanton, 1978). Similarly, there exist many Area Y cells that respond during manual reaching, tracking, or manipulation, and which also have a preferred direction of reaching (Hyvarinen & Poranen, 1974). Such cross-modal coordination of tracking is critical, because as we see in the next section, Pylyshyn’s (2003b, 2007) theory of visual cognition assumes that indices can be applied, and tracked, in different sensory modalities, permitting seeing agents to point at objects that have been visually individuated.

The key innovation and contribution of Pylyshyn’s (2003b, 2007) theory of visual cognition is the proposal of preattentive individuation and tracking. This proposal can be seamlessly interfaced with related proposals concerning visual cognition. For instance, once objects have been tagged by FINSTs, they can be operated on by visual routines (Ullman, 1984, 2000). Pylyshyn (2003b) pointed out that in order to execute, visual routines require such individuation:

The visual system must have some mechanism for picking out and referring to particular elements in a display in order to decide whether two or more such elements form a pattern, such as being collinear, or being inside, on, or part of another element, so on. Pylyshyn (2003b, pp. 206–207)

In other words, visual cognition can direct attentional resources to FINSTed entities.

Pylyshyn’s (2003b, 2007) theory of visual cognition also makes contact with classical cognition. He noted that once objects have been tagged, the visual system can examine their spatial properties by applying visual routines or using focal attention to retrieve visual features. The point of such activities by visual cognition would be to update descriptions of objects stored as object files (Kahneman, Treisman, & Gibbs, 1992). The object file descriptions can then be used to make contact with the semantic categories of classical cognition. Thus the theory of visual indexing provides a causal grounding of visual concepts:

Indexes may serve as the basis for real individuation of physical objects. While it is clear that you cannot individuate objects in the full-blooded sense without a conceptual apparatus, it is also clear that you cannot individuate them with only a conceptual apparatus. Sooner or later concepts must be grounded in a primitive causal connection between thoughts and things. (Pylyshyn, 2001, p. 154)

It is the need for such grounding that has led Pylyshyn to propose a theory of visual cognition that includes characteristics of classical, connectionist, and embodied cognitive science.