Determining per-class feature frequencies in ARFF files on the command line

In this post, I demonstrate how we can analyze an ARFF file to find out how frequent a given feature occurs in a certain class. For simplicity, I assume only binary features, i.e., of type NUMERIC, and either 0 or 1.

The numbers in the comments are examples from my dataset.

Find out the column number of an attribute

Determine first line containing an @ATTRIBUTE:

Determine line containing the desired feature (HasAtLeastOne_auch):

This means, that the desired feature is in column 191 – 8 +1 = 184, which means there are 183 columns before it:

Of course, this whole calculation can be done automatically in script.

Counting the per-class occurrences of that attribute

Here is, where the simplifying assumption kicks in: The following expressions expect that all columns either contain a 0 or 1. They match a sequence of 183 0’s or 1′, separated by a comma and followed by a 1 (presence) or 0 (absence). We expect the class label (A or B) to be the last entry each line. (-E makes grep accept POSIX extended regular expressions (EREs))



