Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-2869

Discovery Browse/Search load full text of documents into Solr's documentCache, which can result in OOM errors

    Details

    • Attachments:
      0
    • Comments:
      3
    • Documentation Status:
      Not Required

      Description

      This is a companion ticket to DS-2832, as it's very similar in nature.

      Discovery will gradually fill your memory up with the indexed full text of documents as you Browse/Search content within DSpace.

      Essentially (as in DS-2832), Discovery fails to ever limit the fields returned by Solr, which means Solr will always return all fields (including the full_text field) for every browse/search query.

      I've verified this via YourKit. If you connect YourKit up to DSpace 5.3 and click "Browse by Title" and page through the results, your memory will gradually fill up until one of two things happen:

      • EITHER you run out of memory (OOM heap size error),
      • OR you've loaded 512 full-text documents into memory (as this is our current size limit for Solr's documentCache).

      The reason this seems to occur is that Discovery's SolrServiceImpl class never actually sends a list of fields to Solr. This means Solr will always return everything in the result.

      The "resolveToSolrQuery()" method is the one that should specify specific fields to Solr: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace-api/src/main/java/org/dspace/discovery/SolrServiceImpl.java#L1614

      From searching in the DSpace codebase, it seems like we only ever use a small handful of fields from the Solr results:

      • handle (used to retrieve the DSO)
      • search.resourcetype & search.resourceid (used to retrieve DSO)
      • Occasionally other fields are passed in via "DiscoveryQuery.addSearchField()", but they are never relayed to Solr

      Most of these default fields are used in one of two methods in that same class:

      The likely fix is to ensure any fields specified via "DiscoveryQuery.addSearchField()" are properly relayed to Solr. We also should ensure the identifier fields are always included. As stated, specifying specific fields in the Solr query will ensure that Solr does NOT send back every field in the results (and load them all into its documentCache in memory).

      I should have a PR sometime tomorrow

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tdonohue Tim Donohue
                Reporter:
                tdonohue Tim Donohue
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: