8.6: Vision, Cognition, and Visual Cognition

Last updated
Save as PDF

Page ID: 21255

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

It was argued earlier that the classical approach to underdetermination, unconscious inference, suffered from the fact that it did not include any causal links between the world and internal representations. The natural computation approach does not suffer from this problem, because its theories treat vision as a data-driven or bottom-up process. That is, visual information from the world comes into contact with visual modules—special purpose machines—that automatically apply natural constraints and deliver uniquely determined representations. How complex are the representations that can be delivered by data-driven processing? To what extent could a pure bottom-up theory of perception succeed?

On the one hand, the bottom-up theories are capable of delivering a variety of rich representations of the visual world (Marr, 1982). These include the primal sketch, which represents the proximal stimulus as an array of visual primitives, such as oriented bars, edges, and terminators (Marr, 1976). Another is the 2½-D sketch, which makes explicit the properties of visible surfaces in viewercentred coordinates, including their depth, colour, texture, and orientation (Marr & Nishihara, 1978). The information made explicit in the 2½-D sketch is available because data-driven processes can solve a number of problems of underdetermination, often called “shape from” problems, by using natural constraints to determine three-dimensional shapes and distances of visible elements. These include structure from motion (Hildreth, 1983; Horn & Schunk, 1981; Ullman, 1979; Vidal & Hartley, 2008), shape from shading (Horn & Brooks, 1989), depth from binocular disparity (Marr, Palm, & Poggio, 1978; Marr & Poggio, 1979), and shape from texture (Lobay & Forsyth, 2006; Witkin, 1981).

It would not be a great exaggeration to say that early vision—part of visual processing that is prior to access to general knowledge—computes just about everything that might be called a ‘visual appearance’ of the world except the identities and names of the objects. (Pylyshyn, 2003b, p. 51)

On the other hand, despite impressive attempts (Biederman, 1987), it is generally acknowledged that the processes proposed by natural computationalists cannot deliver representations rich enough to make full contact with semantic knowledge of the world. This is because object recognition—assigning visual information to semantic categories—requires identifying object parts and determining spatial relationships amongst these parts (Hoffman & Singh, 1997; Singh & Hoffman, 1997). However, this in turn requires directing attention to specific entities in visual representations (i.e., individuating the critical parts) and using serial processes to determine spatial relations amongst the individuated entities (Pylyshyn, 1999, 200 1, 2003c, 2007; Ullman, 1984). The data-driven, parallel computations that characterize natural computation theories of vision are poor candidates for computing relationships between individuated objects or their parts. As a result, what early vision “does not do is identify the things we are looking at, in the sense of relating them to things we have seen before, the contents of our memory. And it does not make judgments about how things really are” (Pylyshyn, 2003b, p. 51).

Thus it appears that a pure, bottom-up natural computation theory of vision will not suffice. Similarly, it was argued earlier that a pure, top-down cognitive theory of vision is also insufficient. A complete theory of vision requires co-operative interactions between both data-driven and top-down processes. As philosopher Jerry Fodor (1985, p. 2) has noted, “perception is smart like cognition in that it is typically inferential, it is nevertheless dumb like reflexes in that it is typically encapsulated.” This leads to what Pylyshyn calls the independence hypothesis: the proposal that some visual processing must be independent of cognition. However, because we are consciously aware of visual information, a corollary of the independence hypothesis is that there must be some interface between visual processing that is not cognitive and visual processing that is.

This interface is called visual cognition (Enns, 2004; Humphreys & Bruce, 1989; Jacob & Jeannerod, 2003; Ullman, 2000), because it involves visual attention (Wright, 1998). Theories in visual cognition about both object identification (Treisman, 1988; Ullman, 2000) and the interpretation of motion (Wright & Dawson, 1994) typically describe three stages of processing: the precognitive delivery of visual information, the attentional analysis of this visual information, and the linking of the results of these analyses to general knowledge of the world.

One example theory in visual cognition is called feature integration theory (Treisman, 1986, 1988; Treisman & Gelade, 1980). Feature integration theory arose from two basic experimental findings. The first concerned search latency functions, which represent the time required to detect the presence or absence of a target as a function of the total number of display elements in a visual search task. Pioneering work on visual search discovered the so-called “pop-out effect”: for some targets, the search latency function is essentially flat. This indicated that the time to find a target is independent of the number of distractor elements in the display. This result was found for targets defined by a unique visual feature (e.g., colour, contrast, orientation, movement), which seemed to pop out of a display, automatically drawing attention to the target (Treisman & Gelade, 1980). In contrast, the time to detect a target defined by a unique combination of features generally increases with the number of distractor items, producing search latency functions with positive slopes.

The second experimental finding that led to feature integration theory was the discovery of illusory conjunctions (Treisman & Schmidt, 1982). Illusory conjunctions occur when features are mistakenly combined. For instance, subjects might be presented a red triangle and a green circle in a visual display but experience an illusory conjunction: a green triangle and a red circle.

Feature integration theory arose to explain different kinds of search latency functions and illusory conjunctions. It assumes that vision begins with a first, noncognitive stage of feature detection in which separate maps for a small number of basic features, such as colour, orientation, size, or movement, record the presence and location of detected properties. If a target is uniquely defined in terms of possessing one of these features, then it will be the only source of activity in that feature map and will therefore pop out, explaining some of the visual search results.

A second stage of processing belongs properly to visual cognition. In this stage, a spotlight of attention is volitionally directed to a particular spot on a master map of locations. This attentional spotlight enables the visual system to integrate features by bringing into register different feature maps at the location of interest. Different features present at that location can be conjoined together in a temporary object representation called an object file (Kahneman, Treisman, & Gibbs, 1992; Treisman, Kahneman, & Burkell, 1983). Thus in feature integration theory, searching for objects defined by unique combinations of features requires a serial scan of the attentional spotlight from location to location, explaining the nature of search latency functions for such objects. This stage of processing also explains illusory conjunctions, which usually occur when the attentional processing is divided, impairing the ability of correctly combining features into object files.

A third stage of processing belongs to higher-order cognition. It involves using information about detected objects (i.e., features united in object files) as links to general knowledge of the world.

Conscious perception depends on temporary object representations in which the different features are collected from the dimensional modules and inter-related, then matched to stored descriptions in a long-term visual memory to allow recognition. (Treisman, 1988, p. 204)

Another proposal that relies on the notion of visual cognition concerns visual routines (Ullman, 1984). Ullman (1984) noted that the perception of spatial relations is central to visual processing. However, many spatial relations cannot be directly delivered by the parallel, data-driven processes postulated by natural computationalists, because these relations are not defined over entire scenes, but are instead defined over particular entities in scenes (i.e., objects or their parts). Furthermore, many of these relations must be computed using serial processing of the sort that is not proposed to be part of the networks that propagate natural constraints.

For example, consider determining whether some point x is inside a contour y. Ullman (1984) pointed out that there is little known about how the relation inside (x, y) is actually computed, and argued that it most likely requires serial processing in which activation begins at x, spreading outward. It can be concluded that x is inside y if the spreading activation is contained by y. Furthermore, before inside (x, y) can be computed, the two entities, x and y, have to be individuated and selected—inside makes no sense to compute without their specification. “What the visual system needs is a way to refer to individual elements qua token individuals” (Pylyshyn, 2003b, p. 207).

With such considerations in mind, Ullman (1984) developed a theory of visual routines that shares many of the general features of feature integration theory. In an initial stage of processing, data-driven processes deliver early representations of the visual scene. In the second stage, visual cognition executes visual routines at specified locations in the representations delivered by the first stage of processing. Visual routines are built from a set of elemental operations and used to establish spatial relations and shape properties. Candidate elemental operations include indexing a salient item, spreading activation over a region, and tracing boundaries. A visual routine is thus a program, assembled out of elemental operations, which is activated when needed to compute a necessary spatial property. Visual routines are part of visual cognition because attention is used to select a necessary routine (and possibly create a new one), and to direct the routine to a specific location of interest. However, once the routine is activated, it can deliver its spatial judgment without requiring additional higher-order resources.

In the third stage, the spatial relations computed by visual cognition are linked, as in feature integration theory, to higher-order cognitive processes. Thus Ullman (1984) sees visual routines as providing an interface between the representations created by data-driven visual modules and the content-based, top-down processing of cognition. Such an interface permits data-driven and theory-driven processes to be combined, overcoming the limitations that such processes would face on their own.

Visual routines operate in the middle ground that, unlike the bottom-up creation of the base representations, is a part of the top-down processing and yet is independent of object-specific knowledge. Their study therefore has the advantage of going beyond the base representations while avoiding many of the additional complications associated with higher level components of the system. (Ullman, 1984, p. 119)

The example theories of visual cognition presented above are hybrid theories in the sense that they include both bottom-up and top-down processes, and they invoke attentional mechanisms as a link between the two. In the next section we see that Pylyshyn’s (2003b, 2007) theory of visual indexing is similar in spirit to these theories and thus exhibits their hybrid characteristics. However, Pylyshyn’s theory of visual cognition is hybrid in another important sense: it makes contact with classical, connectionist, and embodied cognitive science.

Pylyshyn’s theory of visual cognition is classical because one of the main problems that it attempts to solve is how to identify or re-identify individuated entities. Classical processing is invoked as a result, because “individuating and reidentifying in general require the heavy machinery of concepts and descriptions” (Pylyshyn, 2007, p. 32). Part of Pylyshyn’s theory of visual cognition is also connectionist, because he appeals to non-classical mechanisms to deliver visual representations (i.e., natural computation), as well as to connectionist networks (in particular, to winner-take-all mechanisms; see Feldman & Ballard, 1982) to track entities after they have been individuated with attentional tags (Pylyshyn, 2001, 2003c). Finally, parts of Pylyshyn’s theory of visual cognition draw on embodied cognitive science. For instance, the reason that tracking element identities—solving the correspondence problem—is critical is because Pylyshyn assumes a particular embodiment of the visual apparatus, a limited-order retina that cannot take in all information in a glance. Similarly, Pylyshyn uses the notion of cognitive scaffolding to account for the spatial properties of mental images.