8.6: How we deal with the difficulties of computational models?

Last updated
Save as PDF

Page ID: 129546

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Core Recognition

Human and animals have abilities to extremely accurately and quickly recognize objects in their visual systems, which are supported by empirical evidence. For example, humans are able to recognize a briefly presented image in as short as 350ms (Rousselet, Fabre-Thorpe, & Thorpe, 2002; Thorpe, Fize, & Marlot, 1996), and monkeys can do it in 250ms (Fabre-Thorpe., Richard, & Thorpe, 1998). Event-related potential (ERP) experiments found that complex visual processing of object recognition is achieved in 150 ms (Thorpe et al., 1996). Such ability is referred to “core recognition” that the primates are able to perceive and classify visually-presented objects quickly and accurately (DiCarlo, & Cox, 2007).

Invariance problem

Due to the fact that transformed images with tremendous variants are preserved as an identity (DiCarlo, Zoccolan, & Rust, 2012), one can recognize two-dimensionally presented items on the retina (DiCarlo, & Cox, 2007). The perceptual constancy is a significant mechanism for working on object recognition performance without any trouble which is provided by the changes in lighting, size, and backgrounds (Grill-Spector et al., 2001). That is, the variability of the world and the recognizer would lead to enormous images of each object that must be categorized into the identical category (e.g., “dog”) (DiCarlo et al., 2012; Grill-Spector et al., 2001). Thus, human capabilities to perceive and classify objects are not impeded by enormous variabilities of positions, scales, poses, illumination, and clutter (DiCarlo et al., 2012; Grill-Spector et al., 2001).

Besides, one can recognize each part of object (e.g., eyes and legs of a dog) as well as the object as a whole (e.g., a dog), and one can categorize conceptual representations into an existing category like “cats,” “apartments,” “bicycles,” which is called intraclass variability (Grill-Spector et al., 2001). Although each object is presented in visually infinite variants, visual systems achieve the equivalence of all of these different patterns of each object without any confusion with images of all other possible objects. Such human recognition ability that is less sensitive to different visual appearances is referred to ‘cue-invariance’ (Grill-Spector et al., 2001).

Although this invariance problem can be an impediment for computational models to reproduce human’s recognition completely, especially when items with infinite variants are implemented, both empirical (Thorpe, Fize, & Marlot, 1996) and physiological (Hung, Kreiman, Poggio, & DiCarlo, 2005) findings suggest that we would obtain a clue from visual stream to solve this invariance problem rapidly (DiCarlo, & Cox, 2007; DiCarlo et al., 2012; Grill-Spectore, & Malach, 2001; Grill-Spector et al., 2001). For example, Grill-Spector et al. (1998) reported that when subjects were presented to visually-variance objects, the object-selective brain fields, particularly the LOC, are actively stimulated. This finding implies that cue-invariance is observed in our visual recognition system. Kourtzi and Kanwisher (2000) investigated the levels of the brain activities in the LOC. Their results revealed that when subjects are presented to grayscale objects as a whole, the responses of the LOC were stronger compared to when presented to line drawings. In addition, their findings showed that when presented to pairs of same images, both levels of brain activities in LOC were similar (Kourtzi and Kanwisher, 2000). The similar brain responses were shown even when subjects passively saw pairs of stimuli which they were the identical objects but had different morphs (e.g., two different kinds of golden retrievers) (Kourtzi and Kanwisher, 2000). Moreover, Grill-Spector and Malach (2001) conducted a studying for invariant properties of the LOC by using functional magnetic resonance-adaptation (fMR-A). They reported that this area can work on object recognition regardless of the object’s variabilities in size and position (Grill-Spector, & Malach, 2001). These results above demonstrate that there is the cue-invariance in visual object system and emphasize the crucial role of the LOC in object recognition.

In the same vein, studies of fMRI revealed that IT sub-populations show the cue-invariance problem as well (Majaj et al., 2012) and this problem is not limited to a performance of object (i.e., non-face stimuli) recognition (Freiwald, & Tsao, 2010). Thus, when individuals recognize faces of other people, the similar recognition processing has been observed (Freiwald, & Tsao, 2010). Strikingly, it is important to note that IT neuronal populations can strongly support for human pattern recognition theory compared to neuronal populations of the earlier visual systems (Freiwald, & Tsao, 2010; Hung et al., 2005; Rust, & DiCarlo, 2010). Even when subjects perceive intricate visual morphs, IT neuronal populations are readily activated (Brincat, & connor, 2004; Desimone et al., 1984; Perrett et al., 1982; Rust, & DiCarlo, 2010; Tanaka, 1996). Moreover, viewing relatively trivial changes in object items can stimulate IT neurons such as variabilities of object position and size (Brincat, & Connor, 2004; Ito et al., 1995; Li et al., 2009; Rust, & DiCarlo, 2010; Tovee et al., 1994), various poses (Logothesis et al., 1994), variabilities of lighting (Vogels, & Biederman, 2002), and changes of cluster (Li et al., 2009; Missal et al., 1999; Zoccolan et al., 2005). Thus, visual object representations are constructed with invariance problem in the IT neuronal populations (DiCarlo et al., 2012).

The explanation of IT neuronal populations on object recognition

The transmission of visual object information in the brain is a hierarchical processing: visual information is sent first to the retina; from there, this information is transmitted to the lateral geniculate nucleus of the thalamus (LGN), and then to the occipital lobe, specifically area V1 to V2 to V4 to IT (Felleman, & Van Essen, 1991). Traditionally, biologically inspired computational models have tried to duplicate 2D images, and this endeavor implies that 2D visual information is transmitted from early visual systems (areas V2 and V4) to final systems (IT) in the ventral visual stream (Anzai, Peng, & Van Essen, 2007; Gallant, Braun, & Van Essen, 1993; Pasupathy, & Connor, 2001). At each stage, although neurons are tuned for component-level shape, IT can be involved in holistic shape tuning through learning (Baker, Behrmann, & Olson, 2002). Although a large body of literature have figured out the mechanisms of the early visual systems like the area V1 (Lennie, & Movshon, 2005), we do not clearly understand the mechanisms of the final stage such as the IT (Hung et al., 2005; Rust, & DiCarlo, 2010). However, relatively recent several studies have revealed that the activities of the IT populations are clear and stable to perform recognizing objects as well as faces representations that have a variety of variabilities ranging from position to background component (Hung et al., 2005; Rust, & DiCarlo, 2010). Moreover, neuronal analyses suggested that the IT neuronal populations were activated when subjects performed face recognition tasks (Freiwald, & Tsao, 2010). Such neuronal activities could confirm the cue-invariance problem in our visual system (Freiwald, & Tsao, 2010). Thus, the mechanisms of the IT neuronal activities can explain for human invariant object recognition behavior (Majaj et al., 2012). These studies also demonstrate that the explanations for object recognition based on the analysis of the IT populations are more clearly accounted for our visual functions than those motivated by the early visual stream (Freiwald, & Tsao, 2010; Hung et al., 2005; Rust, & DiCarlo, 2010).

DiCarlo et al. (2012) summarized the neurophysiological evidence for IT neuronal populations as follows. First, the IT populations decode and transfer visual representations within 50 ms. Furthermore, after presented images, decoding visual information is accessible beginning under 100 ms. Additionally, the IT populations decode visual representations into neuronal formats with preserving cue-invariance of the objects (DiCarlo et al., 2012). Finally, these simple weighted summation codes are observed when subjects are presented objects without any training for a set of images (Hung et al., 2005). Taken together, by decoding visual representations, IT neuronal populations in the final stage of the ventral visual stream can be applied to computational pattern models (Pinto et al., 2010), but also can account for human object recognition behavior (Hung et al., 2005; Rust, & DiCarlo, 2010).

Shape similarity vs. semantic category information in IT neuronal populations

The efforts of understanding human object recognition behavior have led to develop several computational models. Particularly, the artificial systems based on IT neuronal population representation suggest better explanations for the processing of object recognition performance. However, it is still debatable how visual information is represented and placed in the visual areas. There are two main hypotheses in visual representations: shape similarity vs. semantic category (Baldassi et al., 2013; Huth et al., 2012; Khaligh-Razavi, & Kriegeskorte, 2014). In the shape similarity view, we might consider the IT as a visual representation, because objects are segmented and clustered into visually similar groups. Thus, visual features of objects can be a crucial factor for visual representations in the brain. On the other hand, in the semantic category view, semantic information is significant criteria for representations of visual information. In this view, we might think of the IT as a visuo-semantic representation (Khaligh-Razavi, & Kriegeskorte, 2014).

Several literature claims that simple or intricate visual components are coded into IT neurons which serve as visual representations based on shape similarity (Brincat, & Connor, 2004; Kayaert, Biederman, & Vogels, 2003; Kayaert, Biederman, Beeck, & Vogels, 2005; Yamane et al., 2008; Zoccolan, et al., 2007; Zoccolan et al., 2005). For example, Yamane et al. (2008) report that the representations of IT neuronal populations provide evidence for the important role of shape similarity components in constructing visual neural signals. A study of the IT of monkey used fMRI showed that the visual representations of animate objects were overlapped by inanimate objects in the IT (Freedman et al., 2006). Importantly, compared to the prefrontal cortex, in the IT neuronal populations, the levels of the activities for visually liked objects were not significantly different from those for objects which were clustered by semantic category (e.g., between cat-like and dog-like stimuli) (Freedman et al., 2006). Overall, these studies support the shape similarity theory for visual representations that IT neurons play a significant role in reproducing object identity (Kourtzi, & Connor, 2011).

Although biological evidence has revealed that IT neurons can be considered as a visual representation, recent several studies argue that visual objects are coded into the IT populations in terms of semantic category information (Huth et al., 2012). For example, compared to clusters based on their visual feature similarity, the visual representations are more clustered into several semantic categories such as animals, non-animate objects, and faces (Huth et al., 2012). Although a general semantic area for visual representations in the brain has not been observed, single fields provide the evidence that visual representations are organized by semantically related categories (Connolly et al., 2012; Just et al., 2010; Konkle and Oliva, 2012; Kriegeskorte et al., 2008; Naselaris et al., 2009; O’Toole et al., 2005). In the IT of monkey’s studies, the groups of visual representations in the brain were divided by semantic categories and the similar patterns showed in human’s brain as well. Particularly, the groups for inanimate objects were clearly segregated from those for animate objects (Connolly et al., 2012; Just et al., 2010; Konkle and Oliva, 2012; Kriegeskorte et al., 2008; Naselaris et al., 2009; O’Toole et al., 2005). Moreover, fMRI studies found that semantic-based metrics can explain for visual representations in the monkey’s IT populations patterns (Bell et al., 2009), which is consistent to the results from several studies that semantically segregated visual representations can meaningfully support for human object recognition patterns (Downing, Jiang, Shuman, & Kanwisher, 2001; Kanwisher, 2010; Kanwisher, McDermott, & Chun, 1997; Mahon et al., 2007; Mahon, & Caramazza, 2009; Naselaris et al., 2009).

This semantic category theory could be distinguished by studies that produce object-defining visual features and contrast their explanatory power. However, semantic category-based model cannot explain the functions of the IT populations without shape similarity-based model. Thus, visual representations in the IT are segmented by visual similarity as well as semantic category (Kriegeskorte et al., 2008; Connolly et al., 2012; Huth et al., 2012; Carlson et al., 2013). In order to reproduce visual representational metrics that are similar to those of IT, even unintended property variation with explicit images requires semantic information (Cadieu et al., 2014; Yamins et al., 2014). In fact, an efficient way to account for the visual representations in the IT is that visual similarity appearances does not impede semantic features of objects, and thus suggesting a correlation with each other (Khaligh-Razavi, & Kriegeskorte, 2014).

Computational models accounting for the IT representation

Computational frameworks have been developed to partly resemble the similarity patterns observed in IT cortex of the primates (Khaligh-Razavi, 2014). Sensible as individuals do, however, artificial models cannot fully perform object recognition performance. The development of pattern computational models has led to test realistic theories and to provide an effective measure accounting for primate visual object recognition (Pinto et al., 2008). The question raises whether the current computational recognition models are completely able to support the explanations for the IT neuronal populations and for recognition behaviors. Here the current paper will compare several computational approaches motivated by a biological process to other artificial approaches, and discuss whether those models can suggest the visual object representations in the primate’s IT.

Khaligh-Razavi and Kriegeskorte (2014) categorized the models in mainly two ways with subordinate categories: (1) Unsupervised with category labels: (a) Biologically-inspired object-vision models (e.g., HMAX, VisNet, Stable model, Sparse localized features (SLF), Biological transform (BT), and convolutional network) (Ghodrati et al., 2012; Ghodrati et al., 2014; Hinton, 2012; Jarrett et al., (2009); LeCun, & Bengio, 1995; Riesenhuber, & Poggio, 1999; Serre, Oliva, & Poggio, 2007; Sountsov, Santucci, & Lisman, 2011; Wallis, & Rolls, 1997); (b) Computer-vision models (e.g., GIST, SIFT, PHOG, PHOW, self-similarity features, geometric blur) (Bosch, Zisserman, & Munoz, 2007; Deselaers, & Ferrari, 2010; Lazebnik, Schmid, & Ponce, 2006; Lowe, 1999; Ojala, Pietikainen, & Maenpaa, 2001; Oliva, & Torralba, 2001); (2) Supervised with category labels: (a) Biologically-inspired object-vision models: GMAX and supervised HMAX (Ghodrati et al., 2012), which these approaches can discriminate animate-objects from inanimate-objects due to the training from 884 images set; deep supervised convolutional neural network (DNN) (Krizhevsky et al., 2012), which can perform object recognition by learning from a bunch of semantically-categorized images used by ImageNet (Deng et al., 2009). While computer vision models perform several local image descriptors, biologically-inspired computational models are a hierarchical model consisting of a set of transforms that make an invariant representation of the input image in a neutrally plausible way (Khaligh-Razavi, & Kriegeskorte, 2014).

According to Khaligh-Razavi and Kriegeskorte (2014), they investigated 37 computational approaches whether they can provide evidence for the explanations of the human’s IT about the visual representations. A set of objects physically nonoverlapped with each other was used for the artificial systems to reweight and remix (Khaligh-Razavi, & Kriegeskorte, 2014). They found that the HMAX model and several computer-vision models predicted well early visual cortex responses. Furthermore, most of the models work to discriminate representation patterns of the IT from other visual areas (Khaligh-Razavi, & Kriegeskorte, 2014). Several models produce categorical divisions between animal and human faces, and this finding is consistent with the results of the IT populations of human and monkey that the group for human faces were differently placed in the IT compared to the group for animal faces (Khaligh-Razavi, & Kriegeskorte, 2014).

While several supervised models well performed recognition tasks, all unsupervised models could not succeed to distinguish between human and non-human faces. Besides, the unsupervised models failed to emulate animate/inanimate division of IT populations (Khaligh-Razavi, & Kriegeskorte, 2014). There are still limitations for computational models to reproduce semantic categorizations which are commonly found in the human IT. However, these results provide a powerful suggestion that pattern computational models trained with categorically labeled image sets could be efficient to account for the visual object representations of the IT (Khaligh-Razavi, & Kriegeskorte, 2014).

Deep Neural Networks

While deep neural networks have been developed with progress in useful learning algorithms (Hinton, Osindero, & Teh, 2006; Krizhevsky, Sutskever, & Hinton, 2012; LeCun, & Bengio, 1995), learning process from enormous dataset enables the computational models to classify and recognize visually presented objects. More recent studies report that compared to other computational models, the responses liked human IT populations can be better predicted by the new deep computational models. Furthermore, these new models better produce categorical visual divisions in the vision fields in both human and monkey (Cadieu et al., 2014; Khaligh-Razavi, & Kriegeskorte, 2014).

The new deep neural computational approaches have similar components that are inspired by the primate visual systems (Khaligh-Razavi, & Kriegeskorte, 2014). First of all, feedforward hierarchical structure is the common feature in the new deep neural models. In fact, these models convert to visual information from each prior stage to following stage. Next, each stage has abilities that can linearly filter the former stage which is represented as a nonlinear structure by reducing a single linear transformation (Khaligh-Razavi, & Kriegeskorte, 2014). Besides, after decoding linear visual information of each stage, they are constructed convolutedly. This computed information allows to produce efficient parameters. At the same time, the decoded visual inputs transmit visual information with the cue-invariance (LeCun, & Bengio, 1995). Furthermore, with increasing from stage to stage, the visual representations are placed at a space based on image-visual information and are clustered by shape similarity or semantic category information. Moreover, the neural networks models include four or more layers of representation (Gengio, 2009). Although fewer units or complex patterns are necessary, the deep networks are able to confirm accurate visual information. Finally, the new deep neural networks models can be learned by supervision with a large number of pictures which are categorically labeled (e.g., more than a million image sets) (Krizhevsky, Sutskever, & Hinton, 2012). Thus, it might be possible that the more similar to IT the computational models perform, the better they process object recognition.

The advantages of hierarchical features in computational models

Unlike computer-vision models, biologically inspired object visual models show a hierarchical structure which features of visual representations increase in the intricacy of information from lower stage to upper stage (Poggio, & Ullman, 2013). Such hierarchical visual models are powerful to replicate the visual object recognition in the IT populations.

There are several possible advantages for hierarchical structure of visual representations. First, hierarchical computational approaches can take advantage of the response of IT neurons, which achieve an efficient and robust recognition performance, though objects produce infinite appearances which are influenced by illumination, position, and recognizers (Logothetis et al., 1994; Logothetis, & Sheinberg, 1996). Although computer vision systems can quite readily achieve scale and position invariance by simply matching target objects with dataset which contains images with different scales and positions (Valentin, & Abdi, 1996), this methodology is inadequate to apply realistic recognition theory (Poggio, & Ullman, 2013). Second, hierarchical approaches can offer a benefit in efficiency in quickness and in useful resources (Poggio, & Ullman, 2013). Hierarchical models allow to perform even complex objects recognition by training with over a million of images. Thus, these approaches can achieve a successful recognition performance with learning process of visual images (Poggio, & Ullman, 2013). Finally, the possible advantage of hierarchies is that these models can identify and classify parts of objects (e.g., ears and a tail of a dog) as well as objects as a whole (e.g., a dog and a cat) (Epshtein, Lifshitz, & Ullman, 2008).