Blog

Open any file with Apache Tika

By  
Matti Tahvonen
Matti Tahvonen
·
On May 23, 2024 3:35:55 PM
·

I wrote a handy web utility for you. It can read pretty much any file, detect a mime type and some of its other (file-specific) metadata, and preview its content as text. I'm not sure if it is really useful for anybody, but at least I hope it will work as an example of using Apache Tika in your web app.

Apache Tika is a useful library for metadata and text extraction for several different file formats. You can also extend its capabilities if you expect it to be parsing something more exotic. Tika can be used for dozens of cases, from search indexing to training AI models.

The example app essentially replicates the “official” Tika CLI/desktop functionality as a web app. It uses Vaadin Upload to receive the file to analyze, and Grid and basic text components to display the results.

How was it built?

Creating the project and dependencies

I started with an empty Vaadin project using Maven archetype vaadin-archetype-spring-application. Archetypes are a handy way to create a fresh new project directly in any IDE with the latest supported Vaadin version. As a bonus, in a multi-module project, you can create your Vaadin UI directly into your larger project structure.

The only relevant change for the pom.xml was to add two dependencies for Tika:

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.9.2</version>
    </dependency>

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers-standard-package</artifactId>
        <version>2.9.2</version>
    </dependency>

This brings in the core Tika API and the standard package for parsers.

Uploading the file to a Java server

The MainView contains a description, an area for results, and an Upload component. I configure the Upload component to use FileBuffer, which essentially saves the uploaded file to a temporary file on the server. That file can then later be accessed in the succeeded listener and passed to the previewContent() method (discussed next) like this:

    FileBuffer r = new FileBuffer();
    upload.setReceiver(r);
    upload.addSucceededListener(e -> {
        File tmpFile = r.getFileData().getFile();
        String fileNameFromBrowser = e.getFileName();
        previewContent(fileNameFromBrowser, tmpFile);
        tmpFile.delete();
    });

My demo server has plenty of disk space, but I still delete the file right after it has been handled by Tika.

I have sometimes complained a bit about our Upload API, as it needs a special Receiver to store the file, either in memory or as a temporary file. If you want to pass the input stream directly from the browser to Tika, you can replace the core Upload component with the UploadFileHandler component from the Viritin add-on (same UI component, simpler and more efficient Java API).

Parsing the file with Tika

The previewContent() method basically utilizes the Apache Tika API. I'm using the basic parsing options dug up from the Tika documentation, which also collects text content from the file. Passing in the original file name is not mandatory, as Tika also inspects the actual content of the file, but it may help Tika to provide better and faster results.

private void previewContent(String originalFileName, File tmpFile) {
   AutoDetectParser parser = new AutoDetectParser();
   BodyContentHandler handler = new BodyContentHandler();
   Metadata metadata = new Metadata();
   metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, originalFileName);
   metadata.set("File size", tmpFile.length() + "B");

   try (InputStream stream = TikaInputStream.get(tmpFile)) {
       parser.parse(stream, handler, metadata);
       displayParsingResults(metadata, handler);
   } catch (WriteLimitReachedException ex) {
       Notification.show(ex.getMessage());
       displayParsingResults(metadata, handler);
   } catch (Exception ex) {
       result.add(new H2("Parsing Data failed: " + ex.getMessage()));
       throw new RuntimeException(ex);
   }
}

Tika completes more metadata from the file while it inspects it. The available metadata depends on the file type. 

Once the file is inspected, or Tika has collected text up to its default limit, the results are passed for the displayParsingResults() method to display them in the browser.

Showing results

The call to displayParsingResults() is essentially simple Vaadin usage, where we display the metadata and extract the text in the web UI. 

For the metadata display, I extracted a separate MetadataGrid class that displays the key-value pairs from Tika's Metadata object. Even though this code could be expressed inline with fewer lines of code, and only used once, it is a good convention to extract logical pieces of your UI to separate classes to improve maintainability. 

public class MetadataGrid extends Grid<String> {
   public MetadataGrid(Metadata metadata) {

       // Metadata keys as rows/items
       setItems(metadata.names());
       addColumn(s -> s).setHeader("Property");
       addColumn(s -> metadata.get(s)).setHeader("Value");
       if (metadata.names().length < 6) {

           // adjust size based on rows if only few rows of data
           setAllRowsVisible(true);
       }
   }
}

For the text content, simply use the Pre component (which essentially wraps a pre HTML tag). Below is the full displayParsingResults() method.

private void displayParsingResults( Metadata metadata, BodyContentHandler handler) {
   result.removeAll();
   result.add(new H2("Metadata:"));
   result.add(new MetadataGrid(metadata));
   result.add(new H2("Extracted text:"));
   var extractedText = new Pre(handler.toString());
   extractedText.getStyle().setPadding("1em");
   result.add(extractedText);
}

If you want a better-formatted preview from Tika, you can also configure it to provide rich text output (HTML), which you could display on the Vaadin side with the Html component.

Check out the source code or try the online demo

The full source code of the demo is available on GitHub. If you just want to check out what your file actually is, feel free to try the app on a demo server (1MB input file size limit).

Matti Tahvonen
Matti Tahvonen
Matti Tahvonen has a long history in Vaadin R&D: developing the core framework from the dark ages of pure JS client side to the GWT era and creating number of official and unofficial Vaadin add-ons. His current responsibility is to keep you up to date with latest and greatest Vaadin related technologies. You can follow him on Twitter – @MattiTahvonen
Other posts by Matti Tahvonen