Schyns 3.4. - 4.

From: Walker-Hall James (
Date: Fri Mar 06 1998 - 15:31:19 GMT

3.4. Formal Models of Feature Extraction

> Mathematically, an object is often expressed as an n- dimensional
> feature vector. Each component of the vector encodes the presence
> vs. absence, or the values of the n attributes describing the object
> (e.g., its parts, their shapes, colors and textures). Geometrically,
> different points in n-dimensional space encode different objects,
> and categories of similar objects form clouds of points.

I think this means that mathematical representations of categories work
in the following way: each object falls at a point in ^n dimensional
space^, as determined by its properties. Thus, similar objects will
tend to ^form clouds^ in this space because they share lots of
properties. On the other hand, objects that are very different will
not be so close in this space because the values of their properties
will not lie near another on the various dimensions that constitute n
dimensional space.

> There are many ways to encode objects, ranging from the raw pixel
> intensities of digitized pictures, to sophisticated properties that
> are known to be diagnostic for classification--e.g., number_of_legs,
> has_wings, has_fur, has_feathers, and hibernates.

This makes the point that there are various levels of analysis at which the
features of objects could be plotted in n dimensional space. For
example, if you take a photograph of a bridge, you could plot its
values in n-dimensional space in terms of the brightness of each dot on
the photo, or at a higher level you could plot the values of the lines
and curves that make up the image, or at a higher level still you could
describe the bridge in civil engineering terminology perhaps.
> Our proposal for functional feature creation concerns the extraction
> of new structures from perceptual data.

Schyns proposed theory of feature creation works at a perceptual

> Many models of concept learning have successfully shown that
> category representations can be learned from exemplars when they are
> composed of a small, prespecified feature set (e.g., Gluck & Bower,
> 1988; Krushke, 1992; Rumelhart, Williams & Hinton, 1986; Widrow &
> Hoff, 1960); the task is not to >discover the feature set from
high-dimensional raw data.

...but traditionally models of categories are concerned with data at a
^lower- dimensional^ level. For example, a limited set of verbal
descriptors, such as size, colour etc.

> Standard concept learning models operating in low-dimensional spaces
> could simply be scaled up to operate in high-dimensional spaces.

However, this may not present such a problem, because we can ^convert^
low-dimensional features (eg has brown wings) into high-dimensional
ones (form, colour, brightness etc).

> One of the problems with this idea is that high-dimensional spaces
> are mostly empty. To illustrate, imagine discretizing a line, a
> squared plane, a cube and a hypercube with tiles of equal size
> (e.g., 10 tiles per side). There is a geometric increase (in this
> example, 101, 102, 103, 104) in the number of tiles that cover the
> objects. If each tile is represented by an n-dimensional data point,
> the example shows that one needs approximately 10n tiles to cover an
> n- dimensional space. If the input distribution varies along many
> degrees of freedom, a learning problem in high-dimensional space may
> require an unrealistically large training set to discover robust
> features, even if an asymptotic solution exists in principle.

This may (?) be saying that a model concerned with high-dimensional
properties (like the one proposed here) is inefficient, because 3d
objects that are similar in form, but vary in size, will differ a lot
in their representation in n- dimensional space. This would surely
imply that we would find it difficult to categorize a large cube and a
small cube as ^cubes^. But I think I am probably going wrong here...

> the bias/variance dilemma (Geman et al., 1992).
> Networks make a bias error when they are dedicated to a class of
> solutions that is not appropriate for the categorizations at hand.

If you have a system that^s good at categorizing a limited set of
classes, then if you expose it to stimuli that are outside what it can
do, it will be biased towards the sets that it^s predisposed to.

> However, low bias comes at the cost of high variance, the second
> component of the error (where variance means the discrepancy between
> the correct categorization and the categorization of the network).

If you address the bias by making the ^machine^ (or whatever it is we
are talking about here... I^m not particularly sure myself...) more
general, you are then going to get more error (or high variance) in the
output (the categorizations that the machine makes).

> There is high variance because a flexible system is too sensitive to
> the data: It learns many idiosyncrasies of the exemplars (e.g.,
> differences in lighting conditions, rotation in depth, translation
> in the plane, and so forth) before learning the invariants of a
> category. Consequently, experience with many exemplars is necessary
> for the network to "forget" idiosyncrasies and learn relevant
> abstractions. Only with great experience is the system able to
> categorize accurately (keep the variance low).

So the way to get around this bias/variance trade-off is to expose the
system to lots and lots of different types of stimuli. If you were to
expose this thing to a series of objects that are similar in most
dimensions, then the machine would learn to focus on the limited set of
dimensions that vary between those stimuli. In future encounters with
other sorts of stimuli, it will still focus on this limited set of
dimensions it has used in the past, and this would not be conducive to
successful categorizations. An example of the top of my head would be
to train the machine in human faces discrimination, (make the example
even more potent by saying they are sibling faces), and then require it
to discriminate between aircraft or something (or something very
different from human faces... you know what I mean.)

I would imagine this process is analogous to the one that children go
through in the ^critical^ years of development, and that the biased
machines may be analogous to the cats that can only see vertical lines
after being raised in a room that only has vertical lines on the wall.

> 3.4.2. Dimensionality reduction

> Complex supervised categorization problems in high- dimensional
> spaces would be simplified if it were possible to reduce the
> dimensionality of the input. Several linear and nonlinear
> dimensionality reduction techniques have been designed to achieve
> this goal. Underlying dimensionality reduction is the idea that
> information processing is divided into two distinct stages. A first
> stage constructs a representation of the environment and a second
> stage uses this representation for higher-level cognition such as
> categorization and object recognition. It is hoped that the
> constructed representation in a smaller dimensional space is more
> useful than the raw input representation.

I don^t get this as I thought Schyns was arguing for categorization at
a high-dimensional level, and now he is trying to reduce it. Anyway,
it then says that human information processing starts off with raw
visual input that becomes more and more ^chunked^ and ^organized^ as it
gets to higher stages through visual processing, and then into even
more higher more general processes (such as remembering).

> To illustrate, consider the popular technique called Principal
> Components Analysis. If redundancies exist in the input data, there
> should be fewer sources of variation than there are dimensions
> (i.e., p << n). PCA finds the first k orthogonal directions... etc...

I think this is just mathematical method for reducing data in high
dimensional space. When we are analyzing objects at such high
dimensional space, there will obviously be lots of empty space, and
this just cuts that out.

> Other dimensionality reduction techniques aim at reproducing the
> intrinsic structure of the input space... etc

> Dimensionality reduction techniques also need to give up generality
> for biases, at the expense of possibly missing "important"
> structures in the data. Nevertheless, the existence of
> low-dimensional somatosensory maps in cortex clearly demonstrates
> that brain structures are particularly adept at reducing
> high-dimensional inputs to lower-dimensional representations (see
> Kaas, 1995, for a review).

Dimension reduction should only cut the ^wheat from the chaff^. This
is of course possible, bearing in mind that humans do it every day.

> In analogy to the functional (re)organization of somatosensory maps,
> we would like the formal definition of "important lower-dimensional
> structures" to be closer to the categorization task the system needs
> to solve.

So, if we are to reduce to lower-dimensional models, the reduction take
place not blindly (ie simply cut out all redundant info), instead,
reduction should be done with the to the aims of the categorization in

> Recent approaches to dimensionality reduction have incorporated
> measures of "feature goodness" in the algorithm for determining good
> dimensions of recoding. For example, Intrator (1994; Intrator and
> Gold, 1993) discusses a technique in which input data are projected
> onto dimensions that have many distinct clusters of data points
> (multimodal distributions).

...and this is apparently what Intrator did. But won^t this take us
back to the bias\variance tradeoff, specifically that bias toward the
^distinct clusters^ will result? (Or again, I could have missed
something here...) >In the reviewed dimensionality reduction
techniques, the feature

> extraction stage operates independently of higher-level processes;
> thus there is no guarantee that the extracted features will be
> useful for higher-level processes (Mozer, 1994).

I think this is similar to the point I just made. If the machine
reduces data at a high dimensional level, it should do it bearing in
mind some sort of context. (The higher-level processes). For example,
current goals and motivations etc.

> The functionality principle suggests that the categorizations being
> learned should influence the features that are extracted. In other
> words, top- down information should constrain the search for
> relevant dimensions/features of categorization.

When learning a new category, we are told the defining features of that
category. This should therefore determine the dimensions and features
that we are concerned with initially. As we get better at
discriminating the category concerned, the process becomes more
refined, in that we become more specific on what data is relevant and
what can be reduced.

> Thus, we believe the serial process of (1) projecting high dimension
> space onto a new lower dimension space, then (2) determining
> categorization with new dimensions, will have to be modified such
> that the second process informs the first (see also Intrator, 1993).

I think this is saying that objects are initially analyzed at a very
detailed level, (in high dimensional space) and are then reduced to
more manageable representations. The process takes advantage of a kind
of feedback loop, so that the defining low level dimensional features
of the category become more and more refined.

Walker-Hall James

This archive was generated by hypermail 2b30 : Tue Feb 13 2001 - 16:23:20 GMT