Text authentication, plagiarism, and degree-of-authorship

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Tue, 23 Jul 2002 16:57:57 +0100

On Tue, 23 Jul 2002, Alan Story wrote:

> So the phrase " text authorship" is the solution, is it? And "authorship",
> unlike " property" is merely a neutral word with none of its own baggage?
> Too bad Michel Foucault is not a member of this list.

Alan has a valid point. Authorship is a slippery slope.

Again, this is not my area of expertise, nor do I believe it has much
bearing on the purpose of this Forum, because the authorship of texts
in peer-reviewed journals is, with perhaps a few exceptions, not
problematic or disputed. (The only problem is getting authors to
free access to those texts, at last, by self-archiving them.)

However, it is clear that in the digital world, where plagiarizing will be
so much easier to do, we will need digitometric tools to assess
authorship. The slippery slope is obvious: If I take your text and add
or subtract one word, does that make it my text? No? Well than how different
does it have to be to be a different text? (With this, as Shaw quipped
about another p-word, "Madame, we have already established your profession;
we are merely haggling over the price").

There are already some some digitometric tools being developed (e.g.,
Latent Semantic Indexing http://www.cs.utk.edu/~lsi/) to detect
plagiarism (mostly student plagiarism and software plagiarism), but it
seems obvious, language being the recombinatory symbolic skill it is
that the differences between any pair of texts, whether by the same or
different authors, cannot be absolute or all-or-none but just a matter of

So we will no doubt develop quantitative norms for the the degree of
textual disparity that there is between (1) two arbitrary independent
texts, (2) two texts by different authors on the same topic, (3) two
texts by the same author, etc. Authorship will not only be a matter of
degree, but statistical: I doubt that the degree of quantitative overlap
in these various categories will be continuous and linear. The average
overlap between arbitrary texts may hover at or below 20% on some measure
(I am merely inventing here), between texts by different authors on
the same topic it may jump to 60% and between different texts by the
same author on the same topic it might be 90%. If the variance of these
values is fairly tight, it will be fairly easy to assign them to their
appropriate categories, leaving the range of, say, 95% and above safely
inhabited only by variants of the same text, written by the same author
(or plagiarized by someone else).

To take a plagiarized text and drive the overlap down to 60% or lower,
so as to put it into the range of texts by different authors on the same
topic, and have it still intelligible and informative, would (I
conjecture) require more intelligence than to write the paper for

At least to meet peer-review standards. For student chicanery I plead
nolo contendere.

Stevan Harnad
Received on Tue Jul 23 2002 - 16:57:57 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:46:36 GMT