Re: France's HAL, OAI interoperability, and Central vs Institutional Repositories

From: Stevan Harnad <harnad_at_ecs.soton.ac.uk>
Date: Wed, 4 Oct 2006 00:35:52 +0100

On Tue, 3 Oct 2006, Franck Laloe wrote:

> At 18:39 02/10/2006, Stevan Harnad wrote:
> > (2) why do the deposits need to be directly in HAL, rather than in each
> > author's own Institutional Repository (IR), then harvested by HAL?
>
> There are several reasons for this:
>
> * Hal requires a structure of metadata where each
> author is linked to one (or several) laboratories,
> which in turn are linked to one (or several
> institutions).

But I don't understand: Can these complex affiliation metadata not be
provided (rather trivially)?

> This structure of metadata in Hal is richer than
> OAI-PMH (in its present state).

I understand: HAL does richer tagging than the OAI-PMH IRs. But that
tagging has to be added in any case: So why not add to the metadata
harvested from the distributed deposits in each author's own IR?

> If we harvested all OAI-PMH compliant archives, this would
> introduce redundancy

So what? And there are ways to either remove or coalesce duplicates.
(The problem today -- we cannot remind ourselves often enough -- is too
*few* deposits [15%], not too many, nor too many duplicates!)

> document with incomplete metadata (most institutional repositories only
> mention their institution, not the contribution of others, etc..).

We've agreed that if HAL aspires to have richer metadata than IRs with
their OAI-PMH then the extra tags have to be added; this is not an
answer to the question of why HAL insists on direct deposits, rather
than harvesting from IRs.

And you already mentioned earlier the baroque intricacies of institutional
affiliation -- but not why you don't think this trivial problem cannot
be handled easily by software, with the help of IR affiliations, lists
of co-authors and (why not) central authoritative lists chronicling
all French authors' current (and past!) complexities of affiliation,
turning them into explicit metadata tags...

> * Many OAI repositories do not guarantee sufficient quality,
> and even access to the full text.

The 1st, 2nd and Nth immediate problem today is *lack of content*
not low-quality metadata: The texts [85% of them] are not deposited at
all. The OA movement, and OA self-archiving mandates, are endeavouring
to get that content deposited. Authors' own IRs are the natural place
to deposit it, and to mandate depositing it.

Then (as agreed above) HAL can, if it wishes, harvest that content, and
improve its metadata. (Again, this is no argument against harvesting,
or in favour of direct deposit in HAL.)

As to texts that are deposited in IRs but not made OA: I *wish* that
were the only remaining problem, for I guarantee that if it were, it
would solve itself in short order. (The OA metadata would elicit email
eprint requests, and authors would soon tire of emailing eprints and
would instead set access to their deposited full-texts as OA instead of
Closed Access.)

But we do not have all or most or even much of the target literature --
the peer-reviewed research corpus -- deposited in IRs in Closed Access,
with only their (low-quality) metadata accessible: At least 85%
percent of OA's target content is not deposited at all.

So it seems to me HAL would benefit as much as everyone else from a
self-archiving mandate that would get all that content deposited;
so the only question is *who* will mandate it to be deposited,
and *where*?

So far, the two natural candidate mandaters are the researchers'
own institutions and funders. Clearly institutions have an interest in
mandating that the deposit should be in their own IRs (for institutional
visibility, prestige, and record-keeping). Funders (although some
unthinkingly insist on central deposits today, e.g., in PubMed Central)
are mostly indifferent to where their funded research is deposited,
as long as it is OAI-compliant and OA. So many mandate depositing in
the researcher's own IR too. And PubMed Central should be asking itself
the same questions I am asking you about HAL: Why not deposit in each
researcher's own OAI-compliant IR and simply *harvest* from there?

Institutions have a direct institutional interest in their own IRs;
they are the ones that can best monitor and reward compliance with
self-archiving mandates; and the spectrum of disciplines at research
institutions (mostly universities) effectively cover all of OA's target
content space (whereas central disciplinary and multidisciplinary
repositories do not).

A national repository like HAL is a very good idea, but unless the
problem of the means of mandating and monitoring direct self-archiving
in HAL by all French researchers has an immediate solution, at the very
least a hybrid deposit system would seem to be optimal: *either*
researchers deposit in their own IRs (subsequently harvested and
enhanced by HAL) *or* directly in HAL; but deposit they must.

> * Hal includes certification procedures (which we
> call "stamps") which do not exist in other open archives.

That's fine, but the non-existence that is the immediate problem is not
certificates but deposits! At least 85% of French research output is not
being self-archived at all. Institutional and funder self-archiving
mandates can remedy this, but are all or most of France's research
institutions more likely to agree to mandate and monitor depositing all
of their own output in HAL, or in their own IRs?

The "stamps" could come either way, either via harvesting from IRs or
via direct deposits.

> * In brief, we want Hal to be an homogeneous
> system, really usable by the reader, and by labs
> (even if they belong to several institutions) and
> institutions - all this through a single entry
> into the system.

If HAL can become a direct entry point for all French research
institutions, and they all agree to a means of mandating and monitoring
compliance, nolo contendere!

But what is sure is that a central repository and central depositing
is *not* the only way to get an OA corpus usable by all (authors
and users), in France and worldwide. On the contrary, the nature of
the Internet, the Web, the OAI protocol and any other richer metadata
tagging schemes is such that distributed interoperability -- rather than a
central locus and central management -- is far more likely to prove to be
the successful means of generating and using the OA corpus, in France
and worldwide.

> For instance, my lab belongs to 4 institutions, we do not want to put our
> articles into four open archives; one is enough.

First, if the 4 institutions don't want or need their research output to
be deposited in their own IRs, there is no need to do it. Perhaps the lab
itself will want to have its own IR. Moreover, harvesting works N ways:
Once a paper and its (OAI) metadata are deposited in one IR, other IRs
(as well as Central Repositories like HAL or PubMed Central or Arxiv)
can harvest it; or the author can import/export it to his multiple IR
affiliations. (And let us not forget that even direct deposit takes less
than ten minutes worth of keystrokes!)

In other words, with OAI harvestability, yes, one deposit is enough.

> I am just explaining what we do, and the strategy
> we chose (after much discussion!). I am not
> claiming that it is the best in the world, or
> even superior to others; actually, I know that
> you do not approve it, Stevan. But I personally
> believe in it, because I feel that it meets the
> quality that is necessary to build a real tool for research.

Franck, it has nothing to do with approval or disapproval. *Whatever*
system results in 100% of French research output being made OA (soon!)
-- whether by mandating direct deposit in HAL, or mandating local IR
mandating, or even (mirabile dictu) by having all journals convert to OA
publishing -- realises the goal of the OA movement: 100% OA for
peer-reviewed research output, now.

But is HAL's policy of central deposit and metadata enhancement
sufficient to generate that 100% self-archiving? For if not, then
whatever other desiderata it may be providing, it is not providing OA's
target content.

> * one can easily extract a local institutional
> repository from Hal, and even import all the data locally, if useful.

I don't doubt it. But you have not yet told me how you propose to get all
that content deposited in HAL in the first place, so that institutions
can then harvest back their own content from it: On the face of it,
it would seem that the institutions should be depositing their own
content in their own IRs directly, and HAL should be harvesting it, not
vice versa. But if you do have a plan for a national mandate to deposit
directly in HAL, I would say all this discussion is moot. Without such
a plan, however, this discussion is beside the point (at least insofar
as OA is concerned).

> * one can also transfer documents to Hal from
> local systems using the so called "webservice"
> techniques. In other words you can load documents
> into Hal from your local system for electronic
> documents, without knowing anything about Hal,
> *provided that* your metadata are Hal compatible.
> This is what several institutions are now doing in France.

The French institutions that have already succeeded in getting their
research output into their own IRs -- whether merely OAI-compliant IRs
or HAL-compliant IRs -- have already succeeded in solving the problem we
(or at least I!) am discussing here, for whatever contents they have
succeeded in getting deposited. My guess is that if these deposits are
unmandated, than they represent about 15% of those institutions' annual
research output, and we are back where we started.

The issue, au fond, is not *where* papers are deposited, but *whether*
they are deposited. The only reason I keep harping on institutional IR
depositing rather than central depositing is that institutions are the
primary content providers, in all research disciplines, and hence their
own IRs are the natural place to require their own researchers to
self-archiving their own research output. Moreover, institutions cover
all research disciplines, hence all of OA's target content space.

It is virtually certain that the only way to attain 100% OA self-archiving
is via self-archiving mandates from researchers' institutions and
funders. Hence the only real question about IR deposit vs. HAL deposit
in France is whether the probability of a successful pandisciplinary,
paninstitutional national mandate to deposit in HAL is greater in
France than the probability of institutional and funder mandates to
self-archive institutionally. What is best for France is whichever of
these is in fact more likely.

> Let me finally add that Hal has been conceived to
> combine the advantage of disciplinary open
> archives (what scientists want)

I think that what you wanted to say, Franck, was that (many) scientists
want to be able to search and access all and only the relevant research in
their own disciplines. That they want it all to be in one discipline-based
"archive," and that that archive must have been deposited in directly
rather than harvested -- and even that the realisation of these wishes
requires the full richness of HAL's proposed metadata -- is rather a
theoretical assumption on the part of some, rather than an objective
statement of "what scientists want"...

> and institutional archives (which are indispensable if we want
> institutions to push scientists to deposit their
> [research output]).

This, I think, is closer to assumption-free objectivity: Institutions
*do* want their own output in their own IRs and not just in some
external discipline-based collective database. But here I would agree
with you: Harvesting could work in either direction, to give everyone
what they want.

But harvesting will not get undeposited content deposited; only mandates
will. So the question is whether institutions (and funders) are more
likely to be pushed to push their researchers to deposit their research
output in (1) international disciplinary archives like Arxiv or PubMed
Central, (2) national omnibus archives like HAL, or (3) their own
institutional IRs?

> You can create portals of Hal that
> are institutional, with the logo, words, etc.. of
> the institution, for both upload and download.

I agree completely that harvesting can go either way, so if, mirabile
dictu, HAL succeeded in getting all or most of French research output in
all disciplines directly deposited in HAL, then it would be trivial to
generate virtual IRs for each institution via back-harvesting.

But how do you propose to get the content deposited in HAL in the first
place? You seem to be focussed on centrality and metadata enrichment,
but we need to hear about how you plan to get the content (and how much
of it): The target is 100% of French research output. The baseline today
is 15% spontaneous self-archiving: How do you plan to get from 15% to
100%, and when?

> But at the same time all the documents go to the
> same data base. This is technically possible, but
> requires the solid structure of metadata that I described above.

It requires something much harder to get the solid metadata structure:
it requires 100% of the target content!

> I hope that I have explained the situation clearly.

As Fermat (or the hopeful builder of the perpetuum mobile) would have
conceded: there are still a few little details missing. In this case,
the detail concerns how you plan to get HAL filled. For without that, we
are talking about raising the quality standards and price for a product
that does not yet have any customers (apart from the 15% spontaneous
baseline)...

Stevan Harnad
American Scientist Open Access Forum
http://amsci-forum.amsci.org/archives/American-Scientist-Open-Access-Forum.html

Chaire de recherche du Canada Professor of Cognitive Science
Ctr. de neuroscience de la cognition Dpt. Electronics & Computer Science
Université du Québec à Montréal University of Southampton
Montréal, Québec Highfield, Southampton
Canada H3C 3P8 SO17 1BJ United Kingdom
http://www.crsc.uqam.ca/ http://www.ecs.soton.ac.uk/~harnad/
Received on Wed Oct 04 2006 - 00:42:39 BST

This archive was generated by hypermail 2.3.0 : Fri Dec 10 2010 - 19:48:31 GMT