DKPro Core: Web1T corpus reader

This article describes how to make use of Google’s Web1T corpus. We use the reader provided by DKPro Core.

In 2006, Google Inc. released a corpus of n-grams with a length of up to 5, as announced in their research blog. The data can be obtained from the Linguistics Data Consortium (LDC, see here) for a fee of $150,- (non-members). For those who find this to costly, there is also a way to build one’s own corpus in Web1T format, using DKPro Core.

How to use it

Reading Web1T files is relatively easy. Include the corresponding Maven dependency in your pom.xml and it is a one-liner. The following snippet extracts all n-grams with a length of 1 to 3. Note that the lower bound must be 1, which is a known bug in version 1.4.0.

In the context of DKPro, it is always advisable to keep your corpora organized at at directory that is reflected by the environment variable DKPRO_HOME. Many of the DKPro readers will try to find documents below this directory automatically.

Afterwards, you can query the count of any phrase you like (separate multiple tokens with whitespaces):

Results:

Count for ‘house’: 350467
Count for ‘like you’: 1632
Count for ‘What a wonderful’: 40

If you query n-grams that are not in the index, the reader will complain about this. An earlier post describes how to silence these complaints – another way would be to filter n-grams before handing them to the reader.

Where to get it

The code for this tutorial is available on GitHub.

Maven dependency for the Web1T reader:

The version information of the two dependencies is provided through Maven’s Dependency Management:

Links

  • [1] Announcement by Google Inc
  • [2] Download Web1T corpus from LDC
  • [3] Instructions on how to build custom Web1T files
  • [4] Code on GitHub

Leave a Reply

Your email address will not be published. Required fields are marked *

Please type the characters of this captcha image in the input box

Please type the characters of this captcha image in the input box