Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-849

create a non-Porter Stemming analyzer for DSpace

    Details

    • Type: Improvement
    • Status: Closed (View Workflow)
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.7.0
    • Fix Version/s: 1.8.0
    • Component/s: Documentation, DSpace API
    • Labels:
      None
    • Attachments:
      3
    • Comments:
      4
    • Documentation Status:
      Needed

      Description

      For some use cases for DSpace, the index produced by the standard search analyzer (org.dspace.search.DSAnalyzer) produces unsatisfactorily imprecise results. Creating an alternate analyzer, which omits PorterStemFilter, will be helpful in those use cases. See these threads for more of the backstory:

      http://comments.gmane.org/gmane.comp.db.dspace.user/13404
      http://comments.gmane.org/gmane.comp.db.dspace.user/13407
      http://comments.gmane.org/gmane.comp.db.dspace.user/13420
      http://comments.gmane.org/gmane.comp.db.dspace.user/13427

      I'm attaching a patch, but it's more of a kit. You must first copy [dspace-src]/dspace-api/src/main/java/org/dspace/search/DSAnalyzer.java to [dspace-src]/dspace-api/src/main/java/org/dspace/search/DSNonStemmingAnalyzer.java, then you can apply the patch.

      After patching, you must alter your dspace.cfg file, uncommenting and changing the search.analyzer line so that it reads:

      search.analyzer = org.dspace.search.DSNonStemmingAnalyzer

      Then, do the following:

      • stop Tomcat (taking down your DSpace instance)
      • re-index all content in your DSpace by running:
        [dspace]/bin/dspace index-init
      • start Tomcat
      • test

      All credit for this work goes to Tim Donohue and Stuart Yeates, I just put the pieces together into this patch and ticket.

        Attachments

          Activity

          Hide
          pottingerhj@umsystem.edu Hardy Pottinger (Inactive) added a comment -

          Posting a revised patch, as Maven complains and refuses to build DSpace with the previous patch applied. I have changed the name of the class from DSAnalyzer to DSNonStemmingAnalyzer so Maven will be happy. This patch is less of a kit, you do not need to copy anything before applying it. Patch is against r6496 of trunk.

          Show
          pottingerhj@umsystem.edu Hardy Pottinger (Inactive) added a comment - Posting a revised patch, as Maven complains and refuses to build DSpace with the previous patch applied. I have changed the name of the class from DSAnalyzer to DSNonStemmingAnalyzer so Maven will be happy. This patch is less of a kit, you do not need to copy anything before applying it. Patch is against r6496 of trunk.
          Hide
          tdonohue Tim Donohue added a comment -

          Overall, patch looks good, Hardy. But, I have a slightly different twist on the patch (to avoid so much code duplication between DSAnalyzer & DSNonStemmingAnalyzer).

          So, I've attached a "DS-849-version3.patch" which does the following:

          • Slightly modifies DSAnalyzer to make it easier to extend (for this Analyzer & other custom Analyzers people may wish to create)
          • Changes DSNonStemmingAnalyzer to extend DSAnalyzer & just override the 'tokenStream()' method. This ensures that DSNonStemmingAnalyzer can inherit & use the same stopwords as DSAnalyzer (without having to repeat them).
          • This patch also adds an example config to dspace.cfg for DSNonStemmingAnalyzer

          I don't think any of this should be controversial, so I'll likely commit it to Trunk soon for 1.8.0

          Show
          tdonohue Tim Donohue added a comment - Overall, patch looks good, Hardy. But, I have a slightly different twist on the patch (to avoid so much code duplication between DSAnalyzer & DSNonStemmingAnalyzer). So, I've attached a " DS-849 -version3.patch" which does the following: Slightly modifies DSAnalyzer to make it easier to extend (for this Analyzer & other custom Analyzers people may wish to create) Changes DSNonStemmingAnalyzer to extend DSAnalyzer & just override the 'tokenStream()' method. This ensures that DSNonStemmingAnalyzer can inherit & use the same stopwords as DSAnalyzer (without having to repeat them). This patch also adds an example config to dspace.cfg for DSNonStemmingAnalyzer I don't think any of this should be controversial, so I'll likely commit it to Trunk soon for 1.8.0
          Hide
          tdonohue Tim Donohue added a comment -

          Committed version 3 of patch to Trunk (r6511)

          Show
          tdonohue Tim Donohue added a comment - Committed version 3 of patch to Trunk (r6511)
          Hide
          tdonohue Tim Donohue added a comment -

          Also, forgot to mention, I added basic docs for this DSNonStemmingAnalyzer to: https://wiki.duraspace.org/display/DSDOCDEV/Configuration

          Show
          tdonohue Tim Donohue added a comment - Also, forgot to mention, I added basic docs for this DSNonStemmingAnalyzer to: https://wiki.duraspace.org/display/DSDOCDEV/Configuration

            People

            • Assignee:
              tdonohue Tim Donohue
              Reporter:
              hardyoyo Hardy Pottinger
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: