Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-1836

doi_seq in update-sequences.sql missing

    Details

    • Type: Bug
    • Status: More Details Needed (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Component/s: DSpace API
    • Labels:
      None
    • Attachments:
      0
    • Comments:
      5
    • Documentation Status:
      In Description

      Description

      The file [dspace-source]/dspace/etc/<rdbs>/update-sequences.sql should update all database sequences in case a database dump had to be restored (see: https://wiki.duraspace.org/display/DSDOC4x/Storage+Layer#StorageLayer-MaintenanceandBackup). With DSpace 4.0 we will introduce a doi_seq sequence in the database, but this one is not mentioned in update-sequences.sql. A fix should be easy, but as I am not 100% sure how to fix this, I write this ticket to discuss two possible solutions.

      Every entry in update-sequences.sql sets a sequence to the value it possibly has given when it were used the last time. Most of the sequences will be used to autoincrement a dedicated column. Update-sequences can take the maximum value of the column the sequence is dedicated to. Only for handles it takes the maximum handle suffix of any registered handle.

      For doi_seq we can either use the maximum value of the column doi_id or we can try to determine the largest DOI suffix. If we use the maximum value of the doi_id column we could get a problem if anyone adds manually generated DOIs into the DOI table which collides with the DOIs DSpace generates. But to determine the maximum value of the doi_id column in SQL is easy while it is not so easy to identify the largest DOI suffix in SQL.

      Let's take a quick look how DOIIdentifierProvider generate handles. A new DOI will be generated out of three parts: the DOI prefix, the name space separator and the value of the column doi_id (https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/identifier/DOIIdentifierProvider.java#L819). To select the maximum given DOI suffix within SQL, we would need to know the name space separator. But there is no easy way to get this one into the update-sequences.sql script.

      I personally think that we should use the maximum value of the column doi_id. If anyone adds DOIs manually to the DOI table he or she should know what they are doing.

      What do you think? How should we fix the update-sequences.sql? Should we document any of this anywhere else then in this ticket?

        Attachments

          Issue Links

            Activity

            Hide
            tdonohue Tim Donohue added a comment -

            I may be mistaken, but it sounds like using "max(doi_id)" could cause some major issues when restoring content via AIPs:

            https://wiki.duraspace.org/display/DSDOC4x/AIP+Backup+and+Restore

            Specifically, it sounds like you are assuming that "doi_id" = "doi_seq". Although generally in DSpace, that may be true, it is almost NEVER true when content has been restored from a backup (especially from an AIP). This is one of the primary reasons why "handle_seq" needs special treatment in "update_sequences.sql" and does not make the assumption that "handle_id" = "handle_seq"

            https://github.com/DSpace/DSpace/blob/master/dspace/etc/postgres/update-sequences.sql#L90

            When content is restored from AIPs, it often cannot reuse those table IDs "handle_id" or "doi_id"...so the sequence and id get out-of-sync. AIPs explicitly do not preserve table IDs as they are controlled automatically by the database (and when an AIP is restored, the order of initial object creation is not maintained....so an object which had a handle_id=10 could be restored to handle_id=2011).

            So, I'm worried about the assumption this makes. I'm not sure of any way around that assumption, yet, but it sounds like it will complicate the ability to use AIPs for backup purposes (as the DOI generation may encounter conflicts after you restore content from AIPs)

            Show
            tdonohue Tim Donohue added a comment - I may be mistaken, but it sounds like using "max(doi_id)" could cause some major issues when restoring content via AIPs: https://wiki.duraspace.org/display/DSDOC4x/AIP+Backup+and+Restore Specifically, it sounds like you are assuming that "doi_id" = "doi_seq". Although generally in DSpace, that may be true, it is almost NEVER true when content has been restored from a backup (especially from an AIP). This is one of the primary reasons why "handle_seq" needs special treatment in "update_sequences.sql" and does not make the assumption that "handle_id" = "handle_seq" https://github.com/DSpace/DSpace/blob/master/dspace/etc/postgres/update-sequences.sql#L90 When content is restored from AIPs, it often cannot reuse those table IDs "handle_id" or "doi_id"...so the sequence and id get out-of-sync. AIPs explicitly do not preserve table IDs as they are controlled automatically by the database (and when an AIP is restored, the order of initial object creation is not maintained....so an object which had a handle_id=10 could be restored to handle_id=2011). So, I'm worried about the assumption this makes. I'm not sure of any way around that assumption, yet, but it sounds like it will complicate the ability to use AIPs for backup purposes (as the DOI generation may encounter conflicts after you restore content from AIPs)
            Hide
            pbecker Pascal-Nicolas Becker added a comment -

            This ticket was discussed in a developer meeting: http://irclogs.duraspace.org/index.php?date=2013-12-11

            Show
            pbecker Pascal-Nicolas Becker added a comment - This ticket was discussed in a developer meeting: http://irclogs.duraspace.org/index.php?date=2013-12-11
            Hide
            pbecker Pascal-Nicolas Becker added a comment -

            Good that I asked, I just had the feeling, that I don't overlook all side effects of update-sequences.sql.

            I don't know AIP Backup and Restore as we don't use it here. As far as I just took a look on it, AIP backups item metadata and technical metadata. Whenever a DOI got registered it gets written into the item metadata and will be restored as dc.identifier.uri. But the information in the DOI table gets lost. This information is important to generate new DOIs and to send updated metadata to DataCite.

            We have two possibilities: Either we store the information of the DOI table as technical metadata or we try to detect dois whenever we restore a dc.identifier.uri field and try to regenerate the information in the DOI table with these information. In both cases we could set the doi_id as expected to make the simple bugfix (using max(dom_id) to set doi_seq) work. Tim, can you confirm these ideas?

            I don't know if and when I'll have time to look on AIP. For us it is not important as we want to use Item Level Versioning which is not supported by AIP as well. Nevertheless, I hope I'll have the time to write a bugfix until DSpace 4.1.

            Show
            pbecker Pascal-Nicolas Becker added a comment - Good that I asked, I just had the feeling, that I don't overlook all side effects of update-sequences.sql. I don't know AIP Backup and Restore as we don't use it here. As far as I just took a look on it, AIP backups item metadata and technical metadata. Whenever a DOI got registered it gets written into the item metadata and will be restored as dc.identifier.uri. But the information in the DOI table gets lost. This information is important to generate new DOIs and to send updated metadata to DataCite. We have two possibilities: Either we store the information of the DOI table as technical metadata or we try to detect dois whenever we restore a dc.identifier.uri field and try to regenerate the information in the DOI table with these information. In both cases we could set the doi_id as expected to make the simple bugfix (using max(dom_id) to set doi_seq) work. Tim, can you confirm these ideas? I don't know if and when I'll have time to look on AIP. For us it is not important as we want to use Item Level Versioning which is not supported by AIP as well. Nevertheless, I hope I'll have the time to write a bugfix until DSpace 4.1.
            Hide
            tdonohue Tim Donohue added a comment -

            If the DOI is being copied to metadata (dc.identifier.uri), then it is already being automatically stored in the AIPs as dc.identifier.uri. Trying to store additional information from the DOI table in the AIPs seems duplicative (and hopefully unnecessary).

            It seems like the DOI table is really nearly the same idea as the Handle table. For Handles, they are stored in AIPs as dc.identifier.uri and they are automatically restored (rewritten to the handle table) by the InstallItem.restoreItem() method:
            https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/content/InstallItem.java#L97

            Essentially this takes the passed in Handle (in this case) and re-registers it in the Handle table. So, it might be best to do something similar with DOIs. I'm not very familiar yet with all the DOI code though, so I don't know if this can be done in the same manner or not. Ideally though, I don't think we should be writing the DOI table content into the AIPs. As I said, it's duplicative information and that DOI table is only there to map DOIs to objects, which should be something we can regenerate on restoration.

            Show
            tdonohue Tim Donohue added a comment - If the DOI is being copied to metadata (dc.identifier.uri), then it is already being automatically stored in the AIPs as dc.identifier.uri. Trying to store additional information from the DOI table in the AIPs seems duplicative (and hopefully unnecessary). It seems like the DOI table is really nearly the same idea as the Handle table. For Handles, they are stored in AIPs as dc.identifier.uri and they are automatically restored (rewritten to the handle table) by the InstallItem.restoreItem() method: https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/content/InstallItem.java#L97 Essentially this takes the passed in Handle (in this case) and re-registers it in the Handle table. So, it might be best to do something similar with DOIs. I'm not very familiar yet with all the DOI code though, so I don't know if this can be done in the same manner or not. Ideally though, I don't think we should be writing the DOI table content into the AIPs. As I said, it's duplicative information and that DOI table is only there to map DOIs to objects, which should be something we can regenerate on restoration.
            Hide
            tdonohue Tim Donohue added a comment -

            Rescheduling for 5.0. Needs some deeper analysis on a good resolution here.

            Show
            tdonohue Tim Donohue added a comment - Rescheduling for 5.0. Needs some deeper analysis on a good resolution here.

              People

              • Assignee:
                Unassigned
                Reporter:
                pbecker Pascal-Nicolas Becker
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: