Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-2599

DIM2DataCite crosswalk errors and annotation

    Details

    • Type: Bug
    • Status: Volunteer Needed (View Workflow)
    • Priority: Trivial
    • Resolution: Unresolved
    • Affects Version/s: 4.0, 4.1, 4.2, 4.3, 5.0, 5.2
    • Fix Version/s: None
    • Component/s: DSpace API
    • Attachments:
      0
    • Comments:
      3
    • Documentation Status:
      Not Required

      Description

      Just started setting up DOI's for our repository and noted some issues with the crosswalk.

      At the moment the supported version of the DataCite Kernel is 2.2 see
      http://schema.datacite.org/ for available versions.

      In this version the mandatory fields are 1-5 i.e. Identifier, Creator, Title, Publisher, PublicationYear so we should address these.

      For the rest I think we should not ship DSpace with mappings for these as many things depend on how folks handle their metadata. We might give examples (commented out) on how to to handle additional fields.

      Here are my notes on the DIM2DataCite Crosswalk:
      https://github.com/DSpace/DSpace/blob/dspace-4_x/dspace/config/crosswalks/DIM2DataCite.xsl

      DataCite (1) - Template Call for DOI identifier.

      The cardinality for this one is 1 and should match the doi to be registered/updated/deleted from this dspace instance.
      Documents might already have one or more dois and this will capture all in fields dc.idenfier, dc.identifier.*
      <xsl:template match="dspace:field[@mdschema='dc' and @element='identifier' and starts-with(., 'http://dx.doi.org/')]">
      <identifier identifierType="DOI">
      <xsl:value-of select="substring(., 19)"/>
      </identifier>
      </xsl:template>

      DataCite (4) - Publisher
      Should take existing dc.publisher into account and only fall back on the default defined variable $publisher if no publisher is given.

      DataCite (5) - Publication Year

      Tests for dc.date.available but disseminates dc.date.issued, this will result in an empty publicationYear as dc.date.available is only take into account if no dc.date.issued is given
      <xsl:when test="//dspace:field[@mdschema='dc' and @element='date' and @qualifier='available']">
      <xsl:value-of select="substring(//dspace:field[@mdschema='dc' and @element='date' and @qualifier='issued'], 1, 4)" />
      </xsl:when>

      DataCite (9) - Adds Language information

      DataCite metadata schema (v2.2) allows for ISO 639-2/b (legacy) better ISO 639-3 3 letter language codes
      DSpace uses as dc.language.iso with ISO 639-1 2 letter codes s.
      https://github.com/DSpace/DSpace/blob/dspace-4_x/dspace/config/input-forms.xml#L354
      The DIM2DataCite crosswalk does not transform the language codes just replaces "_" with "-"
      <xsl:template match="//dspace:field[@mdschema='dc' and @element='language' and (@qualifier='iso' or @qualifier='rfc3066')]">
      <xsl:for-each select=".">
      <xsl:element name="language">
      <xsl:choose>
      <xsl:when test="contains(string(text()), '_')">
      <xsl:value-of select="translate(string(text()), '_', '-')"/>
      </xsl:when>
      <xsl:otherwise>
      <xsl:value-of select="string(text())"/>
      </xsl:otherwise>
      </xsl:choose>
      </xsl:element>
      </xsl:for-each>
      </xsl:template>

      DataCite (10), DataCite (10.1) - Adds resourceType and resourceTypeGeneral information

      DSpace uses the DCMI Type vocabulary to populate the field dc.type s.
      https://github.com/DSpace/DSpace/blob/dspace-4_x/dspace/config/input-forms.xml#L263

      a) DIM2DataCite matches dc.type.* to the resource types
      <xsl:template match="//dspace:field[@mdschema='dc' and @element='type']">
      If folks got implemented other type vocabularies and provided it as dc.type.XX these will be mapped too

      b) Mappings in detail which are not really a fit
      Actually I think one can not map these. It depends on the data people got and one tries at this point to map publication types to formats.
      A publication type can have any format, a recorded lecture might will be sound, wheres the script text ...

      <xsl:when test="string(text())='Animation'">Image</xsl:when>
      <xsl:when test="string(text())='Article'">Text</xsl:when>
      <xsl:when test="string(text())='Book'">Text</xsl:when>
      <xsl:when test="string(text())='Book chapter'">Text</xsl:when>
      <xsl:when test="string(text())='Learning Object'">InteractiveResource</xsl:when>
      <xsl:when test="string(text())='Map'">Image</xsl:when>
      <xsl:when test="string(text())='Musical Score'">Sound</xsl:when>
      <xsl:when test="string(text())='Plan or blueprint'">Image</xsl:when>
      <xsl:when test="string(text())='Preprint'">Text</xsl:when>
      <xsl:when test="string(text())='Presentation'">Image</xsl:when>
      <xsl:when test="string(text())='Technical Report'">Text</xsl:when>
      <xsl:when test="string(text())='Thesis'">Text</xsl:when>
      <xsl:when test="string(text())='Working Paper'">Text</xsl:when>

      DIM2DataCite maps dc.type="Other" an anything unmatchable to "Collection". If something can not be mapped it should be omitted.
      Note DataCite Kernel v 3.1 s. https://github.com/datacite/schema/blob/master/www/meta/kernel-3.1/include/datacite-resourceType-v3.xsd
      allows more values.

      DataCite (13) - Adds Size information
      and
      DataCite (14) - Adds Format information

      Since DSpace 1.4.1 dc.format.extent and dc.format.mimetype are no longer added automatically (it is bitstream not item metadata and there was no way of knowing which metadata belonged to which bitstream), the metadata fields still exist in the metadata registry.
      Thus the field should not be used. There might be some instances which have legacy dc.format.extent metadata.

      DataCite (17) - Description
      The allowed values are:
      https://github.com/datacite/schema/blob/master/www/meta/kernel-2.2/include/datacite-descriptionType-v2.xsd
      DIM2DataCite only matches dc.description.abstract to Abstract and anything else to "Other" whereas the default metadata schema in DSpace got
      dc.description.tableofcontents and the actual series information in dc.relation.ispartofseries

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                cjuergen Claudia J├╝rgen
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: