Re: Schyns Comms 01-03 abdi benson braisby

From: HARNAD Stevan (harnad@coglit.soton.ac.uk)
Date: Thu Mar 12 1998 - 19:16:32 GMT


> From: Whitehouse Chantal <cw495@psy.soton.ac.uk>
>
> Commentry 1- Abdi, Valentin, and Edelman
>
> > the principal component approach used on objects described by a low
> > level code (i.e., pixels, Gabor jets).
>
> but I haven't a clue what that is.

A pixel code is one that represents an image as dark and light dots on a
huge image, like a newspaper photo blown up enormously (pixel = picture
cell). You can think of it as a matrix of dots the width and height
of the image. Each pixel would be a dimension. So if the matrix is 100
x 100 pixels, it could be stretched out into a 10,000 pixel "vector," for
each "point" in pixel space.

A simple 2-D example is this: If you have a 3 x 3 matrix of pixels,
let's assume each is either a zero (= nothing) or has a * in it.
Visualise a cross in this matrix, consisting of a vertical column and a
horizontal row of pixels:

0*0
***
0*0

That's the 3x3 pixel representation; then the stretched out
9-bit vector would be:

0*0***0*0

replacing the * with one's gives

010111010

You can now think of ANY pattern in the image as being some point in
9-dimensional space (much the way that you can think of any
point in real space as a point in 3D XxYxZ space (X for height, Y for
width, Z for depth, except that each dimension in this simple space
has only two points, 0 and 1; if the points could vary in lightness
from 0% to 100%, then you'd get a space more like what you are used to,
but in 9 dimensions instead of 3).

So every possible pattern of *'s in a 3x3 pixel image becomes a point in
9-dimensional pixel space. Think of the *'s as points of light hitting
your retina, and think of the pixels as the cones in your retina. If the
cone is hit by light, it fires, if not, then not. So each cone is a
light-detector (= a very simple feature detector); the 9-D vector is
a complex feature detector consisting of the firing or not-firing of
each of the nine cones in that little 3x3 cone part of your retina.

Now 9 dimensions of fire/not-fire is not a lot, but imagine your whole
retina now, and that your cones can not just fire vs. not-fire, but
can fire to any degree, from 0%-100%. The matrix there would be even
bigger than 1000 x 1000, and the stretched vector would be one
million-dimensional. And the possible patterns that could fall on your
retina would be not just the patterns of 0's and 1's (light vs no-light;
fire vs. no-fire) but all combinations of degrees of firing, from 0% to
100%, and everything in between.

Points in a one-million dimensional space are just too much for any
system to handle and recognise patterns in. Something has to be done to
reduce the number of dimensions to something small enough, from which
features, hence patterns could be recognised.

The simplest feature is an on/off pixel (cone). In the 3x3 cross,

0*0
***
0*0

a useful higher-order feature would be 3 *'s in a horizontal row, and
three *'s in a vertical column. As you know, there is evidence for the
existence of such higher-order detectors of lines -- not just vertical
and horizontal, but all angles.

With a vertical and horizontal line detector you can reduce the
representation of the cross from the full 9 dimensions
010111010
to just two dimensions
11
in a two-dimensional SUBspace of the full 9D space. That subspace
consists of only the patterns that have either a horizontal or a
vertical row of three ON pixels. (How many of those are there?).

But just an ON from the vertical line detector and an on from the
horizontal line detector still isn't enough to pick out a cross, because
you would get the same pattern for

*00
***
*00

as well. So you need another detector for when the vertical and the
horizontal share their MIDDLE bit, as in.

 *
***
 *

Those three detectors together would be able to recognise crosses in
this simple space, and would reduce the representation from 9
dimensions to just 3. (There are of course other ways of doing this
simple pattern recognition problem.)

In such a higher-order vertical/horizontal/centre-intersection detector
space, the representation of the cross would be 111, because the
vertical detector, the horizontal detector, and the intersection
detector would all fire. The representation of any other shape with
both a horizontal and a vertical but no centre intersection would be
110. That would not be enough, of course, if you needed to recognise a
T too:

***
0*0
0*0

A lot of the simple geometric shapes would need to have detectors of
their own. Fortunately, including even "H":

*0*
***
*0*

which such a small array could not distinguish from "N" (try it).

So you would need a bigger matrix, plus detectors for lines at other
angles, and in other combinations. The good news, though, is that once
you have all your local feature detectors, they can generalise or "scale
up" to much bigger images -- maybe even the 1000 x 1000 one.

The reason they scale up easily is that an "N" is an N no matter where it
appears in your visual field. It's also an N whether it's small or
large.

So GABOR filters make use of this. They detect lines at a particular
angle anywhere in the visual field. They also detect angles of lines
whether they are small and narrow or big and wide.

So think of the Gabor filtered representation as something that reduces
the 1000 x 1000 pixel matrix of your retina and the associated one
million dimensional vector space down to a much smaller number of
SUBspaces, based on higher-order detectors for lines and angles
irrespective of their size or position on your retina.

Now PCA (principal components analysis) is a statistical technique for
taking a big matrix and reducing its dimensionality down to only
the subspaces and dimensions that matter. It is similar to factor
analysis, in which a lot of mental tests are analysed, and if people's
scores on a lot of them are correlated, then it is assumed that to some
extent they measure the same thing. So that's the first "factor." That
correlation is partialled out (in a very simple and natural way you
might remember from stats -- by using the "residuals" [ask me if you
want that explained]) and then you see whether what's left over after
you have partialled out that first correlations has has any
correlations left. If yes, you partial those out too, for the second
factor, until there is very little variation left.

(Usually 2-3 factors underlie a huge collection of test scores, and
instead of giving you score on every test, you can be given a score on
the three factors or "subscales," such as the verbal, the quantitative
and the spatial subscales; conversely, the tests themselves can be seen
as having high or low "loadings" on each of the factors, so some tests
are more verbal and less spatial, etc..)

That was just an analogy, but dimensionality is reduced in a similar way
in both cases.

Suppose for some reason we did not know that in the world points appear
in 3D space. So we tag them not only with their height, width, and
depth, but with the time of day, the temperature, the humidity, the
sound level, and the date. Points would then be in an 8D space instead
of a 3D space. But we notice that they have certain properties --
spatial, geometric properties, actually -- for which 5 of the
dimensions do not matter: the shapes are INVARIANT under any changes in
the 5 irrelevant dimensions. (A cube is a cube, regardless of the
temperature or time of day). So we could collapse all those irrelevant
parts of 9D space and reduce it all to the 3D subspace of the 8D space
that is relevant for geometric shape recognition.

PCA works a bit like that, except that the way you reduce the
dimensionality is based, as with the mental tests, on correlation.
The correlated dimensions are collapsed, and the uncorrelated ones
are left. As with tests, if you have several that are highly correlated,
you don't need them all, because they are redundant. The same is true
with dimensions. PCA finds the few dimensions that matter. The
correlation between them should be near zero, because they are really
all giving us independent, relevant information. In this case, they
would turn out to be the 3D space of X/Y/X: height, width, depth, which
are at right angles (orthogonal) to one another, so could not possibly
be more uncorrelated!

If you actually did the experiment of using the 8D data (including
temperature) to classify geometric shapes, a PCA would quickly find that
3 dimensions was enough, and would collapse the 5 irrelevant directions.
But of course 3D is still a lot. and even the 2D projection of all the
3D objects in the world is a lot, especially once it is stretched into
the million-dimensional vector. But PCA can be used to reduce that
million-dimensional vector to a smaller and more tractable size, again
by rooting out what is correlated and redundant or irrelevant in
the shadows cast on our retinas by the things we need to recognise.

This qualitative tour of dimension reduction on which I have just taken
you is really all you need in order to understand the Schyns paper and
commentaries.

> > The pca approach represents faces by their projections on a set of
> > orthogonal features (principal components, eigenvectors, "eigenfaces")
> > epitomizing the statistical structure of the set of faces from which
> > they are extracted. These orthogonal features are ordered according to
> > the amount of variance (or eigenvalue) they explain, and are often
> > referred to as "macro-features" (Anderson & Mozer, 1981) or
> > eigenfeatures by opposition with the high level features traditionally
> > used to describe a face (e.g., nose, eyes, mouth).
>
> Unfortunately this doesn't help me much. How are these eigenfeatures
> represented? Are they a set of codes, written descriptions of features,
> or an actual visual display of facial features?

You start with the shape of a face on the retina. Then you stretch that
out into a long vector specifying a point in a high-dimensional vector
space. Then you do PCA dimension reduction to reduce the dimensionality
of the representation from a point in the huge space to a point in a
much smaller feature subspace. One of the features might be the distance
between two circular parts (the eyes), just as the features of the
crosses I described were vertical and horizontal lines. The PCA can find
a minimal set of features that will sort the faces as they need to be
sorted (by facial expression? by identity? by age? by gender? by family?
by race?).

Features are really parts of the original, raw, shadow that the face
cast onto your retina (matrix). The feature detector picks them out, and
ignores all the rest of the variation in the rest of the dimensions: It
is just interested in what is INVARIANT in the category you are trying
to pick out.

Yes, they are a kind of code; not a written code -- though they can
usually be described verbally. I think the feature detectors are best
thought of as FILTERS, through which the retinal shadow passes, losing
most of its details and preserving only the invariant ones (for whatever
categories need to be sorted).

> > Because they are optimal for the set of faces from which they are
> > extracted, eigenfeatures are less efficient for representing faces
> > from a different population and thus generate class-specific effects
> > such as the other race effect

This is the issue of generalisation and scaling: The features extracted
by PCA from this set of faces, may or may not work for another set of
faces.

> This appears to be saying that if we see lots of Caucasian
> faces and not many Japanese faces then we won't be able to
> efficiently represent Japanese faces and so label them as "other
> race" faces. However earlier in the commentary they said that,

Yes, and I think that's true: It's what's behind the joke (which
in China they of course tell exactly in reverse) about the Caucasian
customer in the Chinese restaurant who asks how come all Chinese waiters
look the same, whereas all Caucasian ones are so easy to tell apart...

> > Eigenfeatures are flexible in that they evolve with the faces
> > encountered (Valentin, Abdi, & Edelman, 1996).
>
> This seems to imply that if a person sees more Japanese faces over time
> then they will be able to represent them more efficiently and be less
> likely to label them as "other race" faces. This doesn't seem to make
> sense logically. If we see lots of Japanese faces we don't start to
> think of them as more Caucasian.

No, but we do begin to find the invariant features that are more
characteristic of Japanese faces than Caucasian. We either develop a
specific set of Japanese feature detectors, or, more likely, we enlarge
or refine our existing set of facial feature detectors till they are
able to distinguish Japanese faces as well as Caucasian ones.

> Commentary 2- Benson
>
> > Assuming primary visual cortex (V1) is necessary for object
> > recognition strongly suggests the geniculostriate pathway is
> > fundamental in bootstrapping the dimensionality reduction process.
>
> The dimensionality reduction process is the idea that our environment
> is made up of hundreds of dimensions that we need to condense in some
> way in order to make sense of the, "blooming, buzzing confusion".
> Benson is saying that this condensation process is done in part by the
> actual visual process. i.e. as the information is being taken from the
> retina to the visual cortex, some sort of coding is occurring which
> allows the information to be condensed.

Exactly.

> > For every relevant (detected) feature of a homogeneous class,
> > experience dictates either continuous or discrete measurement. In the
> > former, this leads naturally to a feature vector which includes
> > population sample variance information (variance may be asymmetric
> > about the mean). Identification of a discrete feature immediately
> > enhances categorisability.
>
> A feature can be given either a continuous or a discrete measurement,
> e.g. either a value from the continuous scale 1-100, or an either or
> value such as 0 or 1.

Correct.

> Benson is saying that a discrete value for a
> feature helps categorization of an object made from many such features
> because it already has a discrete category itself. But maybe the degree
> of a feature is important for categorization. Imagine someone was
> describing two different animals to you in terms of features such as
> whether it had fur or not. One animal is very furry and the other has
> little fur. If you are giving features discrete measurements (with 1
> being "fur" and 0 being "no fur") then both animals would be given 1
> for the fur feature. If you were using continuous values, the very furry
> animal could be given 80, and the animal with little fur could be given
> the value 10 for the fur feature. The second case would help you to
> categorise the animals more easily.

You're right. 0/1 furry would be too coarse-grained to sort these animals.
But notice that the highly furry and the minimally furry values along
the continuous dimension are still pretty far apart; the dimension would
not be much use if the animals varied in furriness along the whole
dimension. Either you would have to get very good at telling apart tiny
differences in furriness at the boundary, or you would need to use other
features instead.

> Commentary 3- Braisby and Bradley
>
> > Schyns et al. argue that flexibility in categorisation implies
> > 'feature creation'. We argue that this notion is flawed, that
> > flexibility can be explained by combinations over fixed feature sets,
> > and that 'feature creation' would anyway fail to explain
> > categorisation. We suggest that flexibility in categorisation is due
> > to pragmatic factors influencing feature combination, rendering
> > 'feature creation' unnecessary.
>
> > Schyns et al. argue that fixed feature sets limit the representational
> > (and classificatory) capacity of a conceptual system. However, they
> > incorrectly claim that "Any functionally important difference between
> > objects must be representable as differences in their building blocks"
> > (Section 1.1, paragraph 3). However, this ignores the modes of
> > combination of those building blocks
>
> True. As we know we are born with the ability to identify and make
> sense of certain features such as those that make up the human face.
> It seems to make more sense that we are born with a fixed set of
> features which we learn to combine in different ways to make sense of
> new things rather than somehow actually learn new features. Why
> should we not come equipped with all the necessary building blocks?

Perhaps we do. But do you remember the dimensionality problem? There
might (just might -- I'm not saying it's so) be too many possible ways
to categorise things to make it economical to be born with the features
for doing all of them. B & B add that we can always make new
combinations of fixed features, and that may be all the extra
flexibility we need, and they may be right. But hear two there is room
for two possibilities: If features are added together the way they are
in a rule expressed in a sentence: "It's round and green and bigger than
a breadbox" then we are really just explicitly combining features we
already have detectors for. But it's possible that using a combination
of detectors eventually creates a unitary detector that no longer needs
to do it by explicitly combining simpler detectors. It may become
automatised just as the simple detectors are, so that it picks out
"round-green-bigger-than-a-breadbox" things, lets call them "Ragbatabs"
as quickly and directly as it picks out green things.

One way of interpreting Schyns, Goldstone & Thibaut's "created" features
as just this: an automatised combination of prior features that has been
put together in the service of a new categorisation task. But it could
become more "creative" than that: We may have detectors for lines and
angles and even squares, circles and triangles, but we certainly don't
and can't have detectors for every possible shape of blob. Yet some
specific blob shapes might turn out to be very important to identify
(say, in cancer screening). If we can construct dedicated detectors to
pick out and identify those quickly, reliably and automatically, would
that too just be a combination of prior fixed features?

> > Fodor argues that systems cannot increase
> > their logical power (acquire wholly new features) via learning: the
> > system's vocabulary and mechanisms must already be able to express the
> > 'new' feature, and so that feature has not been 'created'.
>
> The whole idea of creating new features provides such a puzzle. It
> appears to be much simpler for the system to come ready prepared
> with the necessary features and a flexible set of rules for combining
> these features. It would be easier and make more sense for the rules
> to be developed rather than the actual features.

You're right if (1) the fixed set out of which everything else can be
built is not too big (i.e., if there is no dimensionality reduction
problem) and if (2) all features we ever need in a life of
categorisation are just combinations of those fixed ones; but would a
dedicated blob-detector -- one that could be constructed for any possible
2-dimensional shape it might turn out to be important to identify -- just
be a combination of prior features? Could we be born with fixed
detectors for all possible blobs?

And at what point does putting together special combinations of
dimensions become such a demanding (creative?) task that it deserves to
be called feature creation?

> > Despite this being a critical problem, Schyns et al. fail to address
> > it properly. They state that "...categorisations, rather than being
> > based on existing perceptual features, also determine the features
> > that enter the representation of objects" (Section 1.2.4, paragraph
> > 1). Their position appears circular, since they employ 'feature
> > creation' to explain categorisation, but claim that categorisation
> > itself determines 'feature creation'.

What they mean is that how we approach a new categorisation problem is
determined by the repertoire of feature detectors we already have --
not just the fixed ones, but the "created" ones too. First we will try
to see new things with our existing feature detectors. If this works,
fine; if not, if the prior detectors produce a "bias" that does not
result in correct sorting, then we may have to "create" a new detector.
It, in turn, will influence our future categorisations... No
circularity, just a cycle of (1) try to fit everything with existing
features, (2) succeed? fine; fail? (3) create new feature-detectors,
(4) go to (10...



This archive was generated by hypermail 2b30 : Tue Feb 13 2001 - 16:23:20 GMT