Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-2832

Discovery indexing loads full text of documents into Solr's documentCache, which can result in OOM errors

    Details

    • Attachments:
      0
    • Comments:
      4
    • Documentation Status:
      Not Required

      Description

      This is related to DS-2823, and possibly also DS-2788.

      Our current settings for the Solr "documentCache" do not seem to be optimized for the best memory usage (especially for smaller sites which wish to run at 1-2GB memory allocated to Tomcat).

      Currently, the "documentCache" is set at 512 documents by default:
      https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/solr/search/conf/solrconfig.xml#L531

      While this 512 setting may be reasonable for sites with more memory, it may be too large for sites running with just 1GB of memory, if their indexed documents are also larger in size. (For one of our sites with larger PDFs, the Solr Document size seems to be minimally 2MB...and therefore a cache of 512 Documents expands beyond 1GB of memory)

      As noted in the Solr Caching documentation: "The more fields you store in your documents, the higher the memory usage of this cache will be."
      https://wiki.apache.org/solr/SolrCaching
      https://cwiki.apache.org/confluence/display/solr/Query+Settings+in+SolrConfig

      The Solr documentation recommends to not store large fields in the documentCache by specifically enabling Lazy Loading (enableLazyFieldLoading=true) and not specifying larger fields in the "fl" (field list).

      While our "solrconfig.xml" DOES enable Lazy Loading of fields, it seems the default field (df) is always set to "search_text", which includes all fields:

      Default field is "search_text"
      https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/solr/search/conf/solrconfig.xml#L815

      "search_text" is a copy of all fields:
      https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/solr/search/conf/schema.xml#L640

      The end result of all this is that it seems like our Solr documentCache ALWAYS includes the full text of indexed documents. If that full text is regularly large in size, the 512 documents in that cache can quickly use up 1GB (or more) memory, which can result in OOM (heap size) errors for smaller sites.

      We should either find a way to have the full text be "lazy loaded", or decrease the size of the documentCache by default so that it is less likely for smaller sites to run out of memory when utilizing this cache.

      For the one site where I'm seeing this behavior the most, the OOM errors always occur when running `./dspace index-discovery` (no args). But, surprisingly, `./dspace index-discovery -b` runs fine.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tdonohue Tim Donohue
                Reporter:
                tdonohue Tim Donohue
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: