DSpace
  1. DSpace
  2. DS-1226

Batch import from basic bibliographic formats (Endnote, BibTex, RIS, TSV, CSV)

    Details

    • Attachments:
      4
    • Comments:
      36
    • Epic:

      Description

      This proposed extension (implemented by National Documentation Centre/EKT - http://www.ekt.gr) allows the batch import of metadata (and/or bitstreams) to DSpace using the import script and the Biblio-Transformation-Engine tool. The input format can be any bibliographic format (the specific patch includes support for Endnote, RIS, BibTex, TSV and CSV formats).

      The biblio transformation engine (http://code.google.com/p/biblio-transformation-engine/) is an open source java framework developed by the Hellenic National Documentation Centre (EKT, www.ekt.gr) and consists of programmatic APIs for filtering and modifying records that are retrieved from various types of data sources (eg. databases, files, legacy data sources) as well as for outputing them in appropriate standards formats (eg. database files, txt, xml, Excel). The framework includes independent abstract modules that are executed seperately, offering in many cases alternative choices to the user depending of the input data set, the transformation workflow that needs to be executed and the output format that needs to be generated.

      Thus, the attached patch, adds support for utilizing the Biblio-Transformation-Engine in the DSpace batch import procedure where the user only needs to specify the mapping between the input metadata and DSpace metadata. Default mapping are also provided for the default DSpace Dublin Core metadata schema.


      USEFULNESS
      ---------------------
      Suppose a researcher of your institute provides you with a file with his/her publications that you need to import in the repository. Supposing that the format of the file is one the following: CSV, TSV, Endnote, BibTex, RIS (formats that are commonly used for bibliographic metadata) using only one command you can import all the records to the DSpace repository while in parallel, configuration files apply in order to control which metadata is imported and in which DC (or any other schema of the DSpace repository) field it maps.

      For those who know well the use of the Biblio-Transformation-Engine, this extension is powerful given that they can write their own DataLoaders in order to support more input formats. Filtering of records as well as modifying the metadata is also possible with very little effort (using Biblio transformation engine's filters and modifiers). The same applies for the addition of bitstreams in the records.


      CONFIGURATION FILES
      ---------------------------------------
      Since Bibilio-transformation-Engine supports Spring, the only configurations that the user must work with are the Spring XML files for the Dependency Injection. These files are located within "config" directory and the user can specify in them the mapping between input metadata and DSpace Dublin Core schema (or any other schema users have in their repository)


      EXTERNAL LIBRARIES
      -----------------------------------
      This extension makes use of three external java libraries:
      a) jbibtex, a java library for reading bibtex files (under BSD licence - http://www.linfo.org/bsdlicense.html)
      b) opencsv, a java library for reading csv files (under Apache License V2.0 - http://www.apache.org/licenses/LICENSE-2.0)
      c) biblio-transformation-engine, a java library for metadata transformation, fitlering and modification (under European Union Public Licence (EUPL) License, http://www.osor.eu/eupl/european-union-public-licence-eupl-v.1.1)


      HOW TO RUN
      ----------------------

      In the import script, there is a new option (-b) to import using the Biblio-Transformation-Engine and an option -i to declare the type of the input format. All the other options are the same. Option -s points to a file (and not a directory as it used to) that is the file of the input data.

      Thus, to import metadata from the various input format use the following commands:

      for BibTex input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-bibtex -i bibtex
      for csv input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-csv -i csv
      for tsv input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-tsv -i tsv
      for ris input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-ris -i ris
      for endnote input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-endnote -i endnote

      (-e must be a valid email of a DSpace user and -c must be the collection handle the items will be imported)

      Before you run the commands, feel free to change the configuration files (config/spring-bibtex2dspace.xml, config/spring-csv2dspace.xml, config/spring-tsv2dspace.xml, config/spring-ris2dspace.xml, config/spring-endnote2dspace.xml) in order to specify the mapping of the input format to the DC metadata schema of DSpace.
      1. import-patch.diff
        40 kB
        Kostas Stamatis
      2. README.txt
        5 kB
        Kostas Stamatis

        Issue Links

          Activity

          Hide
          Robin Taylor added a comment -
          Hi Kostas,

          Sorry to be a pain but when I try and build the code Maven is giving a warning that concerns me a little...

          "[WARNING] POM for 'gr.ekt:biblio-transformation-engine:pom:0.81:compile' is invalid.
          Its dependencies (if any) will NOT be available to the current build."

          Looking at the POM I notice that there is one odd dependency...

          <systemPath>${basedir}/lib/marc4j.jar</systemPath>

          Is that a true dependency or can it be removed ? If it is a true dependency then it needs to be available from a Maven Repository.

          Cheers, Robin.
          Show
          Robin Taylor added a comment - Hi Kostas, Sorry to be a pain but when I try and build the code Maven is giving a warning that concerns me a little... "[WARNING] POM for 'gr.ekt:biblio-transformation-engine:pom:0.81:compile' is invalid. Its dependencies (if any) will NOT be available to the current build." Looking at the POM I notice that there is one odd dependency... <systemPath>${basedir}/lib/marc4j.jar</systemPath> Is that a true dependency or can it be removed ? If it is a true dependency then it needs to be available from a Maven Repository. Cheers, Robin.
          Hide
          Kostas Stamatis added a comment -
          Dear Robin,

          thank you for pointing out this problem.

          The dependency you mention is a deprecated one, however, I forgot to remove it before uploading the BTE in Maven Central. I just uploaded a new version of BTE (v0.82 - https://oss.sonatype.org/content/repositories/releases/gr/ekt/biblio-transformation-engine/0.82/) which, in a while, it should be synced with Maven Central.
          When this is done, I will update the github code (and the pull request) with the new BTE maven dependency regarding the version. This should resolve the warning you are given when you build the code.


          Regards,

          Kostas
          Show
          Kostas Stamatis added a comment - Dear Robin, thank you for pointing out this problem. The dependency you mention is a deprecated one, however, I forgot to remove it before uploading the BTE in Maven Central. I just uploaded a new version of BTE (v0.82 - https://oss.sonatype.org/content/repositories/releases/gr/ekt/biblio-transformation-engine/0.82/) which, in a while, it should be synced with Maven Central. When this is done, I will update the github code (and the pull request) with the new BTE maven dependency regarding the version. This should resolve the warning you are given when you build the code. Regards, Kostas
          Hide
          Kostas Stamatis added a comment -
          Dear Robin,

          It should be OK now, I just updated the pom dependency with the new version of BTE (v0.82) which doesn't include the marc4j dependency. Pull request is updated accordingly.

          Should you have any other problem, I would be more than happy to help you.

          Regards,

          Kostas
          Show
          Kostas Stamatis added a comment - Dear Robin, It should be OK now, I just updated the pom dependency with the new version of BTE (v0.82) which doesn't include the marc4j dependency. Pull request is updated accordingly. Should you have any other problem, I would be more than happy to help you. Regards, Kostas
          Hide
          Robin Taylor added a comment -
          Pull request merged by Mark (Diggory).

          First draft of documentation here https://wiki.duraspace.org/display/DSDOC3x/Importing+Items+via+basic+bibliographic+formats+%28Endnote%2C+BibTex%2C+RIS%2C+TSV%2C+CSV%29. Feel free to comment or improve.

          Thanks.
          Show
          Robin Taylor added a comment - Pull request merged by Mark (Diggory). First draft of documentation here https://wiki.duraspace.org/display/DSDOC3x/Importing+Items+via+basic+bibliographic+formats+%28Endnote%2C+BibTex%2C+RIS%2C+TSV%2C+CSV%29 . Feel free to comment or improve. Thanks.
          Hide
          Kostas Stamatis added a comment -
          Dear Robin,

          thanks a lot for the documentation you have added in the Wiki and also, please forgive me for the delay to answering to this post.

          I have written a more detailed documentation regarding this functionality. I couldn't find a way to edit the draft documentation in wiki, so please find the documentation in the following link: https://docs.google.com/document/d/1fbv0Nm-OxdCS84RyNJUowl6yWjIcGYqEHhuKu-xh8nE/edit

          Hope it helps and is detailed enough for someone who needs to use this functionality.

          Thanks,

          Kostas
          Show
          Kostas Stamatis added a comment - Dear Robin, thanks a lot for the documentation you have added in the Wiki and also, please forgive me for the delay to answering to this post. I have written a more detailed documentation regarding this functionality. I couldn't find a way to edit the draft documentation in wiki, so please find the documentation in the following link: https://docs.google.com/document/d/1fbv0Nm-OxdCS84RyNJUowl6yWjIcGYqEHhuKu-xh8nE/edit Hope it helps and is detailed enough for someone who needs to use this functionality. Thanks, Kostas

            People

            • Assignee:
              Robin Taylor
              Reporter:
              Kostas Stamatis
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: