5

Bibliographic data, part 2: Dublin Core’s dirty little secret | The Reinvigorate...

 2 years ago
source link: https://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

[This is part two in a series — you should read part 1 first for context and then you might go on to part 3.]

The Dublin Core — metadata made dumb

Just when librarians were in despair of ever getting their data out to the world in a form it could understand, along came the Dublin Core (DC for short) — a simple set of fifteen metadata elements (contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type) that could be used to describe “document-like objects” such as books, journal articles and web pages.

Everyone in the library world got really excited about the Dublin Core for about three weeks in 1999, before realising that you can’t actually do anything with those elements beyond expressing author (called “creator“), title and date. Everything else was too vague to be of any use — coverage, anyone? Relation? Format?

If you don’t believe me, try translating a reference to a journal article into DC — for example, this one that we used in the previous article:

Taylor, Michael P. and Darren Naish. 2007. An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. Palaeontology 50 (6): 1547-1564. doi:10.1111/j.1475-4983.2007.00728.x

I can easily see how to map the author, date and articleTitle, but not the journalTitle, volume, issue, startPage, endPage or DOI.  So one third of the elements are representable.

The Dublin Core people quickly realised that while the fifteen core elements are OK for describing web-pages (which is frankly what they were designed for, despite all the “cross-domain” rhetoric), they were not much use for describing, well, anything else. Not beyond the absolute basics, anyway.

[Oh, and by the way: there was, and is, no standard XML format for Dublin Core, merely guidelines for how to roll your own. Just in case you were wondering. There are standard element names to use (e.g. <dc:title>) but no standard wrapper element to represent the record as a whole.]

Qualified Dublin Core — metadata made slightly less dumb

The solution to the paucity of Dublin Core elements was this thing called “qualified Dublin Core” (although that term doesn’t seem to be used much any more), in which the fifteen core elements are qualified to make them more specific — for example, dateAccepted, dateAvailable and dateCopyrighted are refinements of the core element date. According to the Dublin Core’s own dumb down principle, “a client should be able to ignore any qualifier and use the value as if it were unqualified […] Qualification is therefore supposed only to refine, not extend the semantic scope of an Element.” Sounds good, right?

Except:

  • There is still no canonical XML representation for Dublin Core records, only canonical XML element names for Dublin Core elements.
  • The XML representation of dateAccepted is not, as you might expect, <dc:date type=”accepted”> but <dcterms:acceptedDate>, which means you can’t implement the dumb down principle just by discarding qualifiers, you need to encode specific knowledge of how to map “qualified” to core DC elements in your application. In other words, “qualified Dublin Core” is not qualified at all.
  • The dcterms namespace has its own instances of the fifteen core elements, so when you want to add a contributor, you have to choose (how?) between <dc:contributor> and <dcterms:contributor>.

All of this, inexplicable though it may appear, would perhaps be tolerable. Were it not for the core incompetence of the Dublin Core model. And here at last we come to the promised Dirty Little Secret …

Even qualified Dublin Core can’t describe a journal article

When I first heard this, I flatly refused to believe it. It seemed impossible that anyone could design a metadata element set for describing documents and have it not able to describe a journal article. But it is, amazingly, quite true. When I made my best effort to render the reference above into Qualified Dublin Core, I found that I was able to represent only one additional field (the DOI, and that not very well) beyond the three basic elements (authors, date, title) that basic Dublin Core allowed me.

Critics, with the exception of Oscar Wilde, seem mostly to agree that the death of Little Nell (in Dickens’s Old Curiosity Shop) is one of the saddest passages in literature. Personally, I lean more towards the separation of Rose and Doctor at the end of Doomsday (you know, before the They Can Never See Each Other Again Because The Path Between Universes Has Closed Forever thing got downgraded to She Can’t Appear Again Until The Fourth Season Due To Other Work Commitments). Others may cite the ending of Old Yeller or the departure of the ring-bearers to the Undying Lands after the scouring of the Shire. But for me, the most tragic document ever written is Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata: four and half thousand words of desperate flailing that could have been summarised as “Don’t even bother trying, it just doesn’t work”.

Turns out that the Qualified Dublin Core solution to the problem of citing journal articles was to add — get ready for this — a bibliographicCitation element. Oh, joy! And so the introduction of the Guidelines document concludes with the observation:

Before the introduction of the Dublin Core term ‘bibliographicCitation’ it was not obvious how to describe fully a journal article using Dublin Core metadata. There was no suitable Dublin Core property to capture the journal title, as distinct from the article title, or the volume, issue and page details, other than as part of a general description.

Thank heavens that‘s changed! Now, instead of shoving the journal title, volume, issue and page details into an undifferentiated lump of text in the description field, we can shove the journal title, volume, issue and page details into an undifferentiated lump of text in the bibliographicCitation field!

This, let me remind you, in a specification that includes SEVENTY data elements — the original fifteen core elements, plus 55 added in Qualified DC, of which 15 are duplicates of the originals. And in those 70 elements they couldn’t make room for journal title? Seriously?

The official, sanctioned, allegedly interoperable encoding of my perfectly simple article citation into Dublin Core

Here it is, folks, based on the Guide. Read it and weep:

<mikesMadeUpNamespace:article
xmlns:mikesMadeUpNamespace=”whatever”>
<dc:creator>Michael P. Taylor</dc:creator>
<dc:creator>Darren Naish</dc:creator>
<dcterms:issued>2007</dcterms:issued>
<dc:title>An unusual new neosauropod dinosaur
from the Lower Cretaceous Hastings Beds Group
of East Sussex, England.</dc:title>
<dcterms:isPartOf>urn:ISSN:0081-0239</dcterms:isPartOf>
<dc:publisher>Blackwell</dc:publisher>
<dcterms:bibliographicCitation>
Palaeontology 50(6), 1547-1564. (2007)
</dcterms:bibliographicCitation>
<dc:identifier>info:doi:10.1111/j.1475-4983.2007.00728.x</dc:identifier>
</mikesMadeUpNamespace:article>

It makes me want to cry.

Note that:

  • There is still no standard XML format for Dublin Core records, so I had to make up my own wrapper element (which of course can’t be in either of the two DC namespaces).
  • For the actual elements, I am supposed to use a mixture of elements from dc and dcterms namespaces.
  • The element containing the publication date is not called publicationDate or datePublished, nor even issuedDate or dateIssued, but just issued — unlike, for example, dateSubmitted or dateAccepted.
  • The best I can do by way of trying to express the journal title is to use the dcterms:isPartOf element and give it the ISSN of the journal (wrapped up as a URI), in the hope that whoever uses this record will go and look that ISSN up to find out what journal it pertains to.
  • The publisher is considered an important part of the citation (unlike, say, the journal title, volume, issue or page-range) despite the fact that journal-article citations never include the publisher.
  • It’s considered important to state that the type of the referenced item is Text.
  • The type “Text” is drawn from a vocabulary whose URI is known (I got it from the Guide) but I couldn’t figure out what XML attribute I am supposed to use to point to that URI.

And of course all of this is on top of the utterly baffling brain-damage that is the bibliographicCitation element. And by the way, if the sample bibliographicCitation above doesn’t seem too dreadful to you, then consider this sample Big Undifferentiated Blob Of Text, straight from the Guide:

Proceedings of the International Conference on Dublin Core and metadata for e-communities, 2002; DC-2002: Metadata for e-Communities: Supporting Diversity and Convergence, Florence, Italy, 13-17 October 2002, pp 71-80

bibliographicCitation format: the pain, the glory, the other pain

But at least the client software can reliably parse the journal title, volume, issue, start-page and end-page out of the bibliographicCitation, right?  I mean, it must be in a standard format, right?

Right?

Viewers of a sensitive disposition might wish to look away now.

Here’s what section 2.2 of the Guide says:

Plain text citations may be according to a recognised citation style. Several styles were reviewed by the DCMI Citation Working Group, and are listed on a Citation Styles page, but there is no particular recommendation for choice of style.

And indeed the two sample bibliographicCitation examples above are in noticably different formats even allowing that one is for a journal article and the other for a paper in a proceedings volume — for example, the date is parenthesised in the former but not in the latter.

Oh, and from section 2.1:

Other details of the resource, such as its title and creators, will be described using the usual Dublin Core properties. Optionally, but redundantly, these details may be included in the citation as well.

In other words, any old crap can be shoved in the bibliographicCitation field.

So let’s review: the official way to represent journal title, volume number, issue number, start page and end page in the 70-element Qualified Dublin Core set is: jam them, and quite possibly some other data you happen to have lying around, together into a text blob in any format you happen to feel like.

Of course, for a computer reading the XML to make any use of this information, it will need to parse the bibliographicCitation to figure out what the journal title, etc., are. But since that field can contain any combination of elements in any format, any parser will need to try all sorts of heuristics to match the format and figure out which bits represent what data. Which of course is exactly what you’d have to do if all you had to work with was the plain-text citation that we started this article with, long, long ago.

To summarise: Qualified Dublin Core, with its 70 fields, is no more useful for expressing journal-article citations than plain text.

Oh, am I shouting? Sorry.

Appendix. Don’t even get me started on the use of the OpenURL 1.0 (ANSI/NISO Z39.88) ContextObject KEV format as an alternative for the content of the <bibliographicCitation> field

Having written that heading, I feel no need to expand further on it.

OK, I’m out of here. I need to take a shower.

Tune in next time for yet more pain.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK