This article describes how to serialize and deserialize JCas objects using DKPro’s  XmiWriter and XmiReader components. A runnable Maven project can be found on GitHub.

Dependencies

Only one dependency is necessary, which is available on Maven Central:

As usual in the context of DKPro Core, it is better to omit the version tag and to configure the version of DKPro Core centrally:

Serialization

The basic code for serialization looks as follows:

The target location is the folder where the cached JCases will be stored. You may either pass a String or a File object. Each JCas needs a DocumentMetaData feature structure in order to know the target filename. The filename can either be configured via DocumentMetaData.setDocumentId(String) or via setBaseURI(String) and setURI(String). For details, look at the provided sample project.

Deserialization

The deserialization works analogously, but of course, the XmiReader is not a consumer but a reader component and has to be the first component in the Pipeline:

The source location is identical to the target location of the writer. Additionally, the reader requires a pattern, that describes files to include (“[+]”) and exclude (“[-]”). Patterns obey to the format of Ant patterns.

Download

If you are interested in a “minimal working example”, you can find a Maven project on GitHub.

References

  • [1] Ant patterns

 

DKPro Core contains a component that wraps the popular TreeTagger.

Unfortunately, only the core component de.tudarmstadt.ukp.dkpro.core.treetagger-asl is directly available as Maven artifact, while license restrictions disallow to redistribute the binaries (de.tudarmstadt.ukp.dkpro.core.treetagger-bin) and the models (de.tudarmstadt.ukp.dkpro.core.treetagger-model-{de,en,fr,…}). The DKPro Core developer team provides instructions on how to create the latter artifacts, using an ant build.xml script.

The Maven dependencies of the TreeTagger component look as follows. It is important to use dependency management in order to coordinate the versions of the three artifacts.

 References

  • [1] TreeTagger project site
  • [2] Instructions on packaging the binary and model artifacts

This article describes how to make use of Google’s Web1T corpus. We use the reader provided by DKPro Core.

In 2006, Google Inc. released a corpus of n-grams with a length of up to 5, as announced in their research blog. The data can be obtained from the Linguistics Data Consortium (LDC, see here) for a fee of $150,- (non-members). For those who find this to costly, there is also a way to build one’s own corpus in Web1T format, using DKPro Core.

How to use it

Reading Web1T files is relatively easy. Include the corresponding Maven dependency in your pom.xml and it is a one-liner. The following snippet extracts all n-grams with a length of 1 to 3. Note that the lower bound must be 1, which is a known bug in version 1.4.0.

In the context of DKPro, it is always advisable to keep your corpora organized at at directory that is reflected by the environment variable DKPRO_HOME. Many of the DKPro readers will try to find documents below this directory automatically.

Afterwards, you can query the count of any phrase you like (separate multiple tokens with whitespaces):

Results:

Count for ‘house’: 350467
Count for ‘like you’: 1632
Count for ‘What a wonderful’: 40

If you query n-grams that are not in the index, the reader will complain about this. An earlier post describes how to silence these complaints – another way would be to filter n-grams before handing them to the reader.

Where to get it

The code for this tutorial is available on GitHub.

Maven dependency for the Web1T reader:

The version information of the two dependencies is provided through Maven’s Dependency Management:

Links

  • [1] Announcement by Google Inc
  • [2] Download Web1T corpus from LDC
  • [3] Instructions on how to build custom Web1T files
  • [4] Code on GitHub

Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.

DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:

  1. Token-based n-grams that can be used in UIMA pipelines and
  2. String-based n-grams that can be obtained with a minimum of effort from a list of strings.

The code for this tutorial is available on GitHub.

Token-based N-grams

For this tutorial I use the example sentence “Mary gives John the apple.”  First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:

Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.

As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:

That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):

Mary gives, gives John, John the, the apple

We notice, that the final period is not included in the bigrams.

String-based N-grams

As for the previous example, I use the sample sentence “Mary gives John the apple.”

Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:

A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:

The rest  is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to

In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:

Mary gives, gives John, John the, the apple, apple .

Where to get it

The code for this tutorial is available on GitHub.

Maven Dependency for the n-gram tools:

Maven dependency for the segmenter/tokenizer components:

The version information of the two dependencies is provided through Maven’s Dependency Management:

Links

  • [1] Working with n-grams (official DKPro Core ASL site)
  • [2] NGramIterable Javadoc (latest)
  • [3] Code on GitHub