Details

    • Roadmap Theme:
      Search

      Description

      Apache Tika is a toolkit that can extract text and metadata from a wide variety of mimetyped formats (including PDF, via PDFBox). Employing Tika as an extraction engine in GSearch would immediately expand enormously the possible range of material over which GSearch could operate, and going forward, GSearch would benefit from new parsers and better-performing parsers created as part of that effort.

        Attachments

          Activity

          Hide
          cwilper Chris Wilper added a comment -

          Gert, is this feature fair game for GSearch? If so, can you "Open" it?

          Show
          cwilper Chris Wilper added a comment - Gert, is this feature fair game for GSearch? If so, can you "Open" it?
          Hide
          cwilper Chris Wilper added a comment -

          Assigning Gert just so he gets on the email list for this issue. Gert, our normal process for FCREPO issues is to review them and get concensus among committers if needed. If the issue is "fair game" for Fedora, we then put it into the Open state so people know that it can be worked on. I figured you'd be the best judge on this issue submitted by Adam.

          Show
          cwilper Chris Wilper added a comment - Assigning Gert just so he gets on the email list for this issue. Gert, our normal process for FCREPO issues is to review them and get concensus among committers if needed. If the issue is "fair game" for Fedora, we then put it into the Open state so people know that it can be worked on. I figured you'd be the best judge on this issue submitted by Adam.
          Hide
          ajs6f@virginia.edu A. Soroka added a comment -

          Some useful info:

          Here:

          https://tika.apache.org/0.10/parser.html

          is the crucial API portion. As you can see at that link, Tika expects a SAX handler to receive information that it retrieves about an input. That seems to me to jibe quite well with the current GSearch architecture and shouldn't be difficult to integrate with any future architecture.

          Also, take a look at this format list:

          https://tika.apache.org/0.10/formats.html

          A real cornucopia!

          Show
          ajs6f@virginia.edu A. Soroka added a comment - Some useful info: Here: https://tika.apache.org/0.10/parser.html is the crucial API portion. As you can see at that link, Tika expects a SAX handler to receive information that it retrieves about an input. That seems to me to jibe quite well with the current GSearch architecture and shouldn't be difficult to integrate with any future architecture. Also, take a look at this format list: https://tika.apache.org/0.10/formats.html A real cornucopia!
          Hide
          gertsp Gert Schmeltz Pedersen added a comment -

          I just published the branch fcrepo-1010 at github to get feedback.
          It uses tika 0.10, tika 1.0 is expected in November 2011.
          It is a big jar, 24mb, so PermGen space has to be raised,
          and it doubles the size of fedoragsearch.war
          The branch adds two functions to GenericOperationsImpl:

          • getDatastreamFromTika: retrieves the text only
          • getDatastreamFromTikaWithMetadata: retrieves metadata also
            The branch comes with a test suite in gsearch.test.fgs24_1010,
            where the two functions are tested on both Lucene and Solr.
            The tests have docx, doc, and pdf datastreams,
            but potentially all the Tika formats are available,
            since the branch uses AutoDetectParser in Tika.
          Show
          gertsp Gert Schmeltz Pedersen added a comment - I just published the branch fcrepo-1010 at github to get feedback. It uses tika 0.10, tika 1.0 is expected in November 2011. It is a big jar, 24mb, so PermGen space has to be raised, and it doubles the size of fedoragsearch.war The branch adds two functions to GenericOperationsImpl: getDatastreamFromTika: retrieves the text only getDatastreamFromTikaWithMetadata: retrieves metadata also The branch comes with a test suite in gsearch.test.fgs24_1010, where the two functions are tested on both Lucene and Solr. The tests have docx, doc, and pdf datastreams, but potentially all the Tika formats are available, since the branch uses AutoDetectParser in Tika.
          Hide
          gertsp Gert Schmeltz Pedersen added a comment -

          Adam,
          If you can take the time, please review this issue. I added a comment with some details, when I first committed the branch fcrepo-1010. Yesterday, I committed the new Tika 1.0. Let me know, if you need more info.

          Show
          gertsp Gert Schmeltz Pedersen added a comment - Adam, If you can take the time, please review this issue. I added a comment with some details, when I first committed the branch fcrepo-1010. Yesterday, I committed the new Tika 1.0. Let me know, if you need more info.
          Hide
          ajs6f@virginia.edu A. Soroka added a comment -

          I'm now beginning some simple testing. Further bulletins as events warrant.

          Show
          ajs6f@virginia.edu A. Soroka added a comment - I'm now beginning some simple testing. Further bulletins as events warrant.
          Hide
          ajs6f@virginia.edu A. Soroka added a comment -

          Gert's code does very well. Extraction is possible for both content and metadata. We are following through with some discussions about the best way to expose this excellent new functionality into GSearch XSLT.

          Show
          ajs6f@virginia.edu A. Soroka added a comment - Gert's code does very well. Extraction is possible for both content and metadata. We are following through with some discussions about the best way to expose this excellent new functionality into GSearch XSLT.
          Hide
          gertsp Gert Schmeltz Pedersen added a comment -

          Based on Adam's comments, things are reconsidered.

          Show
          gertsp Gert Schmeltz Pedersen added a comment - Based on Adam's comments, things are reconsidered.
          Hide
          gertsp Gert Schmeltz Pedersen added a comment -

          New code has been committed and pushed. See the new explanation at

          FgsConfig/FgsConfigIndexTemplate/SolrTest/foxmlToSolr_fgs24_1010.xslt

          The choice between text-only, metadata-only, and both should be clearer now. Also, the tricky generation of tags is now clearer, in that all the tags are generated at the same place in TransformerToText.java

          Show
          gertsp Gert Schmeltz Pedersen added a comment - New code has been committed and pushed. See the new explanation at FgsConfig/FgsConfigIndexTemplate/SolrTest/foxmlToSolr_fgs24_1010.xslt The choice between text-only, metadata-only, and both should be clearer now. Also, the tricky generation of tags is now clearer, in that all the tags are generated at the same place in TransformerToText.java

            People

            • Assignee:
              gertsp Gert Schmeltz Pedersen
              Reporter:
              ajs6f@virginia.edu A. Soroka
              Reviewer:
              A. Soroka
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: