DKPro Core: n-grams (token-based and String-based)

Many applications in NLP use n-grams, for instance for calculating text similarity or machine translation. This tutorial demonstrates how to use DKPro Core in order to obtain the n-grams from a text.

DKPro Core is a UIMA-based framework, so many components heavily build upon UIMA types. There is, however, a way to extract n-grams directly from a list of strings, thereby relieving us of the effort to first kick off a UIMA pipeline in order to tokenize our text. The tutorial comprises two parts:

  1. Token-based n-grams that can be used in UIMA pipelines and
  2. String-based n-grams that can be obtained with a minimum of effort from a list of strings.

The code for this tutorial is available on GitHub.

Token-based N-grams

For this tutorial I use the example sentence “Mary gives John the apple.”  First, we split the sentence into its tokens. As we need to use UIMA components anyway, I use the BreakIteratorSegmenter component for tokenization:

Afterwards, jCas contains the tokens of the sentence and we can build the n-grams from theses tokens: NGramIterable‘s factory method create takes an iterable of tokens and a maximum number for the n in our n-grams. In our case, I want to extract all bigrams and choose n=2.

As with every iterator, we can now use the iterator methods hasNext and next in order to retrieve the n-grams. Unfortunately, the iterator will return all n-grams up to a length of n, i.e., all unigrams/tokens and bigrams. but we only want the bigrams! We can use a little trick to identify the bigrams: A bigram always covers exactly two tokens and so we can use JCasUtil.selectCovered to check how may tokens an n-gram actually subsumes:

That’s it, when we run the application, we get the following output (I omitted some boilerplate/formatting code in the above listings):

Mary gives, gives John, John the, the apple

We notice, that the final period is not included in the bigrams.

String-based N-grams

As for the previous example, I use the sample sentence “Mary gives John the apple.”

Compared to the token-based example, this one is much easier. We replace the segmenter component with a call to String.split. To keep the regular expression simple, I add a whitespace before the period:

A second line of code already produces our desired iterator over the n-grams. Note that we may specify a minimal and maximal n for our n-grams here:

The rest  is almost identical, but our n-grams are now Strings and we do not need to care about the n-grams length. The output code reduces to

In contrast to the token-based approach, this n-gram iterator also produces the bigrams with the period in it:

Mary gives, gives John, John the, the apple, apple .

Where to get it

The code for this tutorial is available on GitHub.

Maven Dependency for the n-gram tools:

Maven dependency for the segmenter/tokenizer components:

The version information of the two dependencies is provided through Maven’s Dependency Management:

Links

  • [1] Working with n-grams (official DKPro Core ASL site)
  • [2] NGramIterable Javadoc (latest)
  • [3] Code on GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *

Please type the characters of this captcha image in the input box

Please type the characters of this captcha image in the input box