Details

    • Roadmap Theme:
      Search

      Description

      Apache Tika is a toolkit that can extract text and metadata from a wide variety of mimetyped formats (including PDF, via PDFBox). Employing Tika as an extraction engine in GSearch would immediately expand enormously the possible range of material over which GSearch could operate, and going forward, GSearch would benefit from new parsers and better-performing parsers created as part of that effort.

        Activity

        Hide
        Chris Wilper added a comment -
        Gert, is this feature fair game for GSearch? If so, can you "Open" it?
        Show
        Chris Wilper added a comment - Gert, is this feature fair game for GSearch? If so, can you "Open" it?
        Hide
        Chris Wilper added a comment -
        Assigning Gert just so he gets on the email list for this issue. Gert, our normal process for FCREPO issues is to review them and get concensus among committers if needed. If the issue is "fair game" for Fedora, we then put it into the Open state so people know that it can be worked on. I figured you'd be the best judge on this issue submitted by Adam.
        Show
        Chris Wilper added a comment - Assigning Gert just so he gets on the email list for this issue. Gert, our normal process for FCREPO issues is to review them and get concensus among committers if needed. If the issue is "fair game" for Fedora, we then put it into the Open state so people know that it can be worked on. I figured you'd be the best judge on this issue submitted by Adam.
        Hide
        A. Soroka added a comment -
        Some useful info:

        Here:

        https://tika.apache.org/0.10/parser.html

        is the crucial API portion. As you can see at that link, Tika expects a SAX handler to receive information that it retrieves about an input. That seems to me to jibe quite well with the current GSearch architecture and shouldn't be difficult to integrate with any future architecture.

        Also, take a look at this format list:

        https://tika.apache.org/0.10/formats.html

        A real cornucopia!

        Show
        A. Soroka added a comment - Some useful info: Here: https://tika.apache.org/0.10/parser.html is the crucial API portion. As you can see at that link, Tika expects a SAX handler to receive information that it retrieves about an input. That seems to me to jibe quite well with the current GSearch architecture and shouldn't be difficult to integrate with any future architecture. Also, take a look at this format list: https://tika.apache.org/0.10/formats.html A real cornucopia!
        Hide
        Gert Schmeltz Pedersen added a comment -
        I just published the branch fcrepo-1010 at github to get feedback.
        It uses tika 0.10, tika 1.0 is expected in November 2011.
        It is a big jar, 24mb, so PermGen space has to be raised,
        and it doubles the size of fedoragsearch.war
        The branch adds two functions to GenericOperationsImpl:
        - getDatastreamFromTika: retrieves the text only
        - getDatastreamFromTikaWithMetadata: retrieves metadata also
        The branch comes with a test suite in gsearch.test.fgs24_1010,
        where the two functions are tested on both Lucene and Solr.
        The tests have docx, doc, and pdf datastreams,
        but potentially all the Tika formats are available,
        since the branch uses AutoDetectParser in Tika.
        Show
        Gert Schmeltz Pedersen added a comment - I just published the branch fcrepo-1010 at github to get feedback. It uses tika 0.10, tika 1.0 is expected in November 2011. It is a big jar, 24mb, so PermGen space has to be raised, and it doubles the size of fedoragsearch.war The branch adds two functions to GenericOperationsImpl: - getDatastreamFromTika: retrieves the text only - getDatastreamFromTikaWithMetadata: retrieves metadata also The branch comes with a test suite in gsearch.test.fgs24_1010, where the two functions are tested on both Lucene and Solr. The tests have docx, doc, and pdf datastreams, but potentially all the Tika formats are available, since the branch uses AutoDetectParser in Tika.
        Hide
        Gert Schmeltz Pedersen added a comment -
        Adam,
        If you can take the time, please review this issue. I added a comment with some details, when I first committed the branch fcrepo-1010. Yesterday, I committed the new Tika 1.0. Let me know, if you need more info.
        Show
        Gert Schmeltz Pedersen added a comment - Adam, If you can take the time, please review this issue. I added a comment with some details, when I first committed the branch fcrepo-1010. Yesterday, I committed the new Tika 1.0. Let me know, if you need more info.
        Hide
        A. Soroka added a comment -
        I'm now beginning some simple testing. Further bulletins as events warrant.
        Show
        A. Soroka added a comment - I'm now beginning some simple testing. Further bulletins as events warrant.
        Hide
        A. Soroka added a comment -
        Gert's code does very well. Extraction is possible for both content and metadata. We are following through with some discussions about the best way to expose this excellent new functionality into GSearch XSLT.
        Show
        A. Soroka added a comment - Gert's code does very well. Extraction is possible for both content and metadata. We are following through with some discussions about the best way to expose this excellent new functionality into GSearch XSLT.
        Hide
        Gert Schmeltz Pedersen added a comment -
        Based on Adam's comments, things are reconsidered.
        Show
        Gert Schmeltz Pedersen added a comment - Based on Adam's comments, things are reconsidered.
        Hide
        Gert Schmeltz Pedersen added a comment -
        New code has been committed and pushed. See the new explanation at

        FgsConfig/FgsConfigIndexTemplate/SolrTest/foxmlToSolr_fgs24_1010.xslt

        The choice between text-only, metadata-only, and both should be clearer now. Also, the tricky generation of tags is now clearer, in that all the tags are generated at the same place in TransformerToText.java
        Show
        Gert Schmeltz Pedersen added a comment - New code has been committed and pushed. See the new explanation at FgsConfig/FgsConfigIndexTemplate/SolrTest/foxmlToSolr_fgs24_1010.xslt The choice between text-only, metadata-only, and both should be clearer now. Also, the tricky generation of tags is now clearer, in that all the tags are generated at the same place in TransformerToText.java

          People

          • Assignee:
            Gert Schmeltz Pedersen
            Reporter:
            A. Soroka
            Reviewer:
            A. Soroka
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: