Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-1226

Batch import from basic bibliographic formats (Endnote, BibTex, RIS, TSV, CSV)


    • Attachments:
    • Comments:
    • Documentation Status:


      This proposed extension (implemented by National Documentation Centre/EKT - http://www.ekt.gr) allows the batch import of metadata (and/or bitstreams) to DSpace using the import script and the Biblio-Transformation-Engine tool. The input format can be any bibliographic format (the specific patch includes support for Endnote, RIS, BibTex, TSV and CSV formats).

      The biblio transformation engine (http://code.google.com/p/biblio-transformation-engine/) is an open source java framework developed by the Hellenic National Documentation Centre (EKT, www.ekt.gr) and consists of programmatic APIs for filtering and modifying records that are retrieved from various types of data sources (eg. databases, files, legacy data sources) as well as for outputing them in appropriate standards formats (eg. database files, txt, xml, Excel). The framework includes independent abstract modules that are executed seperately, offering in many cases alternative choices to the user depending of the input data set, the transformation workflow that needs to be executed and the output format that needs to be generated.

      Thus, the attached patch, adds support for utilizing the Biblio-Transformation-Engine in the DSpace batch import procedure where the user only needs to specify the mapping between the input metadata and DSpace metadata. Default mapping are also provided for the default DSpace Dublin Core metadata schema.

      Suppose a researcher of your institute provides you with a file with his/her publications that you need to import in the repository. Supposing that the format of the file is one the following: CSV, TSV, Endnote, BibTex, RIS (formats that are commonly used for bibliographic metadata) using only one command you can import all the records to the DSpace repository while in parallel, configuration files apply in order to control which metadata is imported and in which DC (or any other schema of the DSpace repository) field it maps.

      For those who know well the use of the Biblio-Transformation-Engine, this extension is powerful given that they can write their own DataLoaders in order to support more input formats. Filtering of records as well as modifying the metadata is also possible with very little effort (using Biblio transformation engine's filters and modifiers). The same applies for the addition of bitstreams in the records.

      Since Bibilio-transformation-Engine supports Spring, the only configurations that the user must work with are the Spring XML files for the Dependency Injection. These files are located within "config" directory and the user can specify in them the mapping between input metadata and DSpace Dublin Core schema (or any other schema users have in their repository)

      This extension makes use of three external java libraries:
      a) jbibtex, a java library for reading bibtex files (under BSD licence - http://www.linfo.org/bsdlicense.html)
      b) opencsv, a java library for reading csv files (under Apache License V2.0 - http://www.apache.org/licenses/LICENSE-2.0)
      c) biblio-transformation-engine, a java library for metadata transformation, fitlering and modification (under European Union Public Licence (EUPL) License, http://www.osor.eu/eupl/european-union-public-licence-eupl-v.1.1)

      HOW TO RUN

      In the import script, there is a new option (-b) to import using the Biblio-Transformation-Engine and an option -i to declare the type of the input format. All the other options are the same. Option -s points to a file (and not a directory as it used to) that is the file of the input data.

      Thus, to import metadata from the various input format use the following commands:

      for BibTex input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-bibtex -i bibtex
      for csv input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-csv -i csv
      for tsv input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-tsv -i tsv
      for ris input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-ris -i ris
      for endnote input: ./dspace import -b -m mapFile -e example@email.com -c 123456789/1 -s /DATA/export-endnote -i endnote

      (-e must be a valid email of a DSpace user and -c must be the collection handle the items will be imported)

      Before you run the commands, feel free to change the configuration files (config/spring-bibtex2dspace.xml, config/spring-csv2dspace.xml, config/spring-tsv2dspace.xml, config/spring-ris2dspace.xml, config/spring-endnote2dspace.xml) in order to specify the mapping of the input format to the DC metadata schema of DSpace.


          Issue Links



              • Assignee:
                robintaylor Robin Taylor
                kstamatis Kostas Stamatis
              • Votes:
                0 Vote for this issue
                7 Start watching this issue


                • Created: