How are humans able to recognize an object when it is seen at an orientation in depth where it projects an image substantially different from that projected at its original pose? Recent theorizing has coalesced around two general theoretical positions. According to "view-based" template theories, an object is represented as a set of 2D templates, one for each pose (1). Such accounts hold that humans must have already experienced an object, or one just like it, from a new viewpoint if they are able to recognize it with little cost in time or accuracy. These accounts posit that slight rotations in depth can be compensated by direct generalization from a 2D template of the object, but greater orientation disparities incur a cost as a deliberative mental operation, such as "mental rotation," must be engaged to achieve recognition. The primary observation motivating these theories is an increase in reaction times (RTs) and error rates in recognizing a novel object viewed at a different orientation in depth from that previously experienced.
An alternative view holds that humans can exploit certain viewpoint-invariant properties (VIPs) for recognition of novel objects at novel orientations in depth (2). VIPs are properties of objects that are relatively unaffected by rotation in depth, such as whether a given contour is straight or curved, or whether pairs of such contours are parallel or not or whether the type of vertex formed by the cotermination of contours is an L, Y, or an arrow (3). VIPs can be distinguished from Metric Properties (MPs) that are affected by rotation in depth, such as the length (or aspect ratio) of a part or degree of curvature. According to one version of such a theory (4), as long as the same parts are in view, little or no cost in recognition should be apparent if the parts can be readily distinguished by VIPs. View-based theorists admit the use of VIPs only with a known and restricted set of objects but, in general, assign no special status to VIPs as compared to MPs.
There is great interest in how objects can be recognized from a new orientation in depth because unlike transformations of position, size, or rotation in the plane, which can readily be compensated by standard algorithms based on the two-dimensional spatial filtering of the image, rotation in depth rotation alters the two-dimensional spatial components so standard techniques lose their capacity to distinguish one object from another at novel depth orientations. Yet people seem to be able to recognize familiar objects at new orientations in depth with little or no cost in recognition speed and accuracy (2). Is this capacity based on prior experience with the particular objects or does it reflect a general capacity to exploit viewpoint-invariant properties, even for novel objects?
The purpose of the present investigation was twofold. The first was a determination of whether depth invariance could be revealed in a first-time encounter with an object. The second was a comparison of VIPs and MPs with respect to their utility for allowing recognition under depth rotation. A unique feature of the design was the psychophysical calibration of VIP and MP differences at the same (unrotated) orientation so that the effect of depth rotation could be studied unconfounded with the saliency of original viewpoint differences.
Subjects judged whether two sequentially-presented images of novel objects, at different orientations in depth, were physically identical or not. They were not given any training with the objects. When the objects differed they could differ in a VIP or in an MP. The magnitude of the VIP and MP differences in shape were selected so as to be equally detectable when the object was at the same orientation.
The results provided strong support for immediate viewpoint invariance over depth rotation and a striking distinction between the employment of VIPs and MPs: When the objects differed in a VIP there was no effect of depth rotation on either reaction times or error rates. In contrast, error rates for detecting differences in MPs of objects at different orientations in depth were dramatically higher than those for VIPs--to a level well below chance, accompanied by a marked increase in RTs. These results provide a strong challenge to view-based, template theories and their accompanying analyses of object recognition.
A set of 12 rendered two-part nonsense objects, as depicted in Fig. 1, comprised the original set of objects (5). An arbitrary viewpoint for each object was chosen that clearly revealed both

Fig. 1. The set of original three-dimensional novel objects.
parts of the object. Two variations for each object were employed (Fig. 2): one was a change in a MP of a part, e.g., a change in the curvature of the axis of the cylinder; and the other was a change in a VIP of a part that would produce a different geon, e.g., change from a curved to a straight axis of the cylinder. In the scaling phase of the experiment, both variations were made to be equally detectable from this arbitrary viewpoint.

Fig. 2. An example of possible variations from the original version in the calibration and rotation phases. The VIP change is one of a curved to straight cylinder. The MP change is a change in the degree of curvature of the cylinder. In the calibration phase (0 deg) the objects are depicted from identical orientations, with the magnitude of the VIP and MP changes selected to yield equal detectabilty as shown on the 0 deg value in Fig. 4. In the rotation phase, objects were rotated (an average of) 57 deg. The differences in surface lightness at 57 deg is a consequence of a single light source used in the rendering (which provided a potential cue as to the degree of rotation).
The sequence of events on a trial are illustrated in Fig. 3. Following a press of the mouse button, a fixation dot appeared for 500 ms, followed by a 400 ms presentation of the object, which was then immediately followed by a mask consisting of a combination of different gray-level objects presented for 500 ms. A second object image was then presented for 300 ms, followed by a second 500 ms mask. The second stimulus was translated randomly over nine possible positions on the screen, specified by a three by three matrix with adjacent horizontal or vertical centers separated by 6.8 deg. Thus, the second image could be above or below, and/or, to the right or to the left of the first image which was always centered. The translation was incorporated into the presentation sequence to dissuade employment of local "hot trace" strategies in which small locations could be monitored for any change and, hopefully, to engage the ventral cortical visual system which is presumed to mediate object recognition and which reveals translation invariance (6).

Fig. 3. Sequence of events on an experimental trial. An illustration of a VIP DIFFERENT trial in the rotation phase of the experiment. The sequence would be the same in the calibration phase.
Subjects were instructed to ignore the intervening mask, and when the second image appeared, to press as quickly as possible a microswitch labeled "same" if the object depicted in the second image was identical to the first, and a microswitch labeled "different" if they were images of different objects (differing in a MP or a VIP). Subjects could not anticipate whether there was going to be a stimulus change nor, if there was a change, whether it would be of an MP or a VIP, or what particular part would undergo the change or what type of MP change or VIP change would be present. No feedback was provided during the experiment. The design was balanced for order such that the mean serial presentation for all the objects was equivalent across subjects.
The experiment was run in two phases: (a) A calibration phase in which MP and VIP differences between a pair of stimuli, shown at the same orientation (0 deg rotation), were selected so as to be equally detectable, and (b) A rotation phase in which subjects attempted to detect the same MP or VIP differences but where the objects were rotated in depth. Five of the 10 subjects took the calibration phase first; the other five the rotation phase. No difference in the results were apparent in the two groups of subjects.
Calibration (same orientation) phase. Subjects judged as same or different objects that were presented at identical orientations. On 40% of the trials, the stimuli were identical and the subjects were to respond with depression of a key marked "SAME." On 60% of the trials the stimuli differed either in a VIP or an MP (with equal probability). Stimulus values of VIP and MP changes were selected so that identical performance levels were achieved. For these stimuli, the mean RT was 775 ms and the mean ER was 20.8% (as shown in the 0deg. condition in FIG. 4).
Experimental (Rotation phase). In the rotation phase of the experiment, the first and second images were depicted from different orientations in depth, by an average of 57 deg (range 20 deg to 120 deg) (7). The order of trials was balanced so that the mean serial presentation position for all sequences of pairs of stimuli was equivalent across subjects. In the rotation phase, half the trials were SAME and half were DIFFERENT (8).
Error rates and RTs are shown in Fig. 4. There was virtually no effect of rotation in depth in the detection of VIP differences (an increase of 4.2% in error rates and a decrease of 15 msec in reaction times). However, rotation produced a dramatic increase of 50% in error rates, to a level well below that expected by chance, and 138 msec in RT for detection of MP differences (9). These results have been replicated with two groups of 10 subjects, one run on the calibration task, the other on the rotation task (10).
The main questions addressed by this investigation were: a) Are VIP differences more detectable than differences in MPs in the matching of depth-rotated objects, when the two types of differences were made to be equally detectable at 0deg. and, b) When objects differ in a VIP, is viewpoint invariance over rotation in depth possible? The answer to both questions is yes. The VIP changes were substantially easier to detect from novel viewpoints, and matching performance in this condition was viewpoint-invariant. That is, human subjects can immediately exploit viewpoint-invariant information without prior familiarity with the objects or any anticipation that the objects will actually differ in a VIP or what that VIP will be.
In the present study only modest VIP changes were employed in that only the identity of one of the parts was changed, the relations remained intact, and the overall variation was calibrated to be equivalent to a metric change. These VIP differences would more closely resemble typical subordinate-level classifications, such as distinguishing a round table from a square one or whether a handle of an object was curved or straight, rather than basic-level classifications. Variation among two-part objects from different basic level classes could be approximated, perhaps, if the VIP Different trials used different objects from Fig. 1. Given that viewpoint invariance was achieved with the modest VIP variation in the present investigation, we would expect it to be even more strongly evident when distinguishing VIP variation that approximated what is encountered when distinguishing among objects at a basic level.


Fig. 4. Mean error rates (upper panel) and mean correct reaction times (lower panel) for 10 subjects as a function of differences in orientation and the type of difference (Same, MP Different, and VIP Different). RTs greater than 2,000 msec were counted as errors. Error bars are the S.E.s when variance attributable to main effects of subjects and objects are removed.
REFERENCES AND NOTES
1. e.g., M. J. Tarr, H. H. Bülthoff, J. Exp. Psychol., 21, 1494 (1995); T. Poggio, S. Edelman, Nature, 343, 263 (1990); N. K. Logothetis, J. Pauls, H. H. Bülthoff, T. Poggio, Current Biol., 4, 401 (1994). Rock, I., & DiVita, J. Cognitive Psychology, 19, 280 (1987).
2. e.g., I. Biederman, Psychol. Rev. 94, 115 (1987); I. Biederman, P. C. Gerhardstein, J. Exp. Psychol., 19, 1161 (1993); I. Biederman, P. C. Gerhardstein, J. Exp. Psychol., 21, 1506 (1995); S. Dickinson, S. Pentland, A. Rosenfeld, Comp. Vis. Image. Underst., 55, 130 (1993).
3. D. Lowe, Perceptual Organization and Visual Recognition (Kluwer, Boston, 1985).
4. I. Biederman, P. C. Gerhardstein, (1993), ibid.
5. The objects were drawn on an SGI Indigo2 work station and rendered from a single light source. For each object, all three versions, original, VIP, and MP changes, were pictured from the same arbitrarily chosen viewpoint, designated as 0 deg. In the experiment, the images were displayed on a Macintosh screen with each image subtending an average visual angle of 8.2 deg.
6. L. G. Ungerleider and M. Mishkin, in Analysis of Visual Behavior, D. J. Ingle, M. A. Goodale, R. J. W. Mansfield Eds. (Cambridge, MA: MIT, 1982), pp. 549-586. That human object recognition is, in fact, translationally invariant was demonstrated by I. Biederman, E. E. Cooper Perception 20, 585 (1991).
7. The angles were selected to be the maximum rotations that would allow the same parts to remain readily apparent in the two orientations. The similarity of the MP and VIP differences as gray scale images between the two orientations was also evaluated by a model of similarity based on activation values of a 10 X 10 lattice of columns of Gabor filters centered over the image. Each column had sine and cosine filters at 5 scales and 8 orientations (= 80 kernels) at each of the 70 locations developed by M. Lades, J. C. Vorbrüggen, J. Buhmann, J. Lange, C. von der Malsburg, R. P. Würtz, W. Konen, IEEE Trans. Computers 42, 300 (1993). This model provides an excellent representation for human face recognition and the variation in its similarity values as a function of rotation in depth and expression changes correlate highly with human performance in judging whether two faces are the same (P. Kalocsai, I. Biederman, Poster, Assoc. Res. Vis. Opthal., Sarasota, May, 1994; J. Fiser, I Biederman, E. E. Cooper, Assoc.. Res. Vis. Opthal., Sarasota, May, 1994.). The model does not make either VIPs or parts (or edges) explicit and consequently can be employed to assess image similarity, independent of these image characteristics. When scaled by this system, the VIP and MP images were equivalent with respect to each of their differences from the original at 0 deg and their similarity to each other (i.e., the MP images at 0 deg and at 57 deg compared to the VIP images at 0 deg and 57 deg).
8. In the calibration phase (143 trials), for each stimulus, each original stimulus underwent two or three MP and VIP variations. The particular MP and VIP differences used in the rotation phase were selected from the calibration phase to yield equal performance. In the rotation phase (196 trials), each of the three possible versions of each object, Original, MP change, and VIP change, at 0 deg and (the average rotation of) 57 deg, were followed by a different version at a different orientation. One of these stimuli were always an original image and the other the MP change or VIP changed version. For each forward order of stimuli, the reversed order was also included. The order of stimuli was reversed across different pairs of subjects. An analysis of the data by the first exposure of the stimuli for a given subject revealed the same pattern of results as shown in Fig. 4.
9. The relatively high error rates (20%) in the same and VIP differing trials, even at 0 deg, was a likely consequence of the general unfamiliarity of the subjects with the task.
10. In the replication, at 0 deg, the VIP and MP changed stimuli again had identical mean RTs and error rates (in parenthesis) of 774 msec (20%). These values were 797 msec (20%) for Same trials. On the rotation phase the results were: MP Differences, 962 msec (64%); VIP Differences, 829 msec (18.5%); Same 837 msec (22%). As in the results shown in Fig. 3, accuracy in the detection of MP differences in rotated stimuli was well below chance. The slight elevation in RTs for the VIP difference from 0 deg to 57 deg could simply represent a group difference (slower responding in the rotation group) or a more conservative criterion in that error rates for this condition actually declined slightly (by 1.5%) compared to that condition for the 0 deg rotation group.