Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-1523

detection of duplicate items during import and submission

    Details

    • Attachments:
      0
    • Comments:
      1
    • Documentation Status:
      Needed

      Description

      Users expressed the need for DSpace to detect whether an item they're about to import/submit already exists in the repository. This issue is trying to capture the requirements for this feature.

      The major point here is the definition of a duplicity. Some uses already have a strict definition of a duplicity, e.g. an equal value of a metadata field (dc.identifier.uuid). Others may depend on similarity of multiple metadata fields (e.g. dc.title, dc.issn) which may be expressed by Levenshtein distance while the rest may even be different (e.g. different values in dc.contributor.autor).

      This leads me to the conclusion that we need to provide a way for users to define their own method of comparison by means of a plugin. The disadvantage of this approach is that checking each imported item against all existing items using an user-defined (possibly non-optimally fast) method may slow down import and therefore the feature needs to be opt-in. Of course we should provide implementations for some commonly used cases, like those mentioned above. The input to the comparison method should be the item DSO (so that its metadata and bitstreams can be read) with the parent object filled in so that the search can be restricted to a community/collection in order to make it possible to reduce the search scope.

      Here are some recent discussion on this topic:

        Attachments

          Issue Links

            Activity

            Hide
            tdonohue Tim Donohue added a comment -

            Needs a volunteer to implement.

            A very simple version could be just to check file checksums on upload. If a file of that checksum already exists in DSpace you could notify the submitter.

            Show
            tdonohue Tim Donohue added a comment - Needs a volunteer to implement. A very simple version could be just to check file checksums on upload. If a file of that checksum already exists in DSpace you could notify the submitter.

              People

              • Assignee:
                Unassigned
                Reporter:
                helix84 Ivan Masár
              • Votes:
                2 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: