java – Calculating the Shannon entropy of a string (e.g. E. coli genome)


This is exercise 3.1.34. from the book Computer Science An Interdisciplinary Approach by Sedgewick & Wayne:

The Shannon entropy measures the information content of an input
string and plays a cornerstone role in information theory and data
compression. Given a string of n characters, let f(c) be the frequency
of occurrence of character c. The quantity p(c) = f(c)/n is an estimate
of the probability that c would be in the string if it were a random
string, and the entropy is defined to be the sum of the quantity -p(c)*log2(p(c)),
over all characters that appear in the string. The entropy is
said to measure the information content of a string: if each character
appears the same number times, the entropy is at its minimum value
among strings of a given length. Write a program that takes the name
of a file as a command-line argument and prints the entropy of the
text in that file. Run your program on a web page that you read
regularly, a recent paper that you wrote, and the E. coli genome
found on the website.

Here is my program:

public class ShannonEntropy
{
    public static String removeUnnecessaryChars()
    {
        String text = "";
        while (!StdIn.isEmpty())
        {
            String word = StdIn.readString();
            int wordLength = word.length();
            String newWord = "";
            for (int i = 0; i < wordLength; i++)
            {
                if (word.charAt(i) != '.' &&
                    word.charAt(i) != '!' &&
                    word.charAt(i) != '?' &&
                    word.charAt(i) != ',' &&
                    word.charAt(i) != '"' &&
                    word.charAt(i) != ':' &&
                    word.charAt(i) != ';' &&
                    word.charAt(i) != '(' &&
                    word.charAt(i) != ')')
                    {
                        newWord += word.charAt(i);
                    } 
            }
            text += newWord;
        }
        return text.toLowerCase();
    }
    // this method (below) is written specifically for texts without
    // unnecessary characters (e.g. E. coli genome)
    public static String convertTextToString() 
    {
        String text = "";
        while (!StdIn.isEmpty())
        {
            String word = StdIn.readString();
            text = word;
        }
        return text;
    }
    public static int() findFrequencies(String text)
    {
        int textLength = text.length();
        /*
        char() ALPHABET = {'a','b','c','d','e','f','g','h','i','j','k','l',
                           'm','n','o','p','q','r','s','t','u','v','w','x',
                           'y','z'};
        */
        char() ALPHABET = {'a','c','g','t'}; // specifically used for genes and genomes
        int() frequencies = new int(ALPHABET.length);
        for (int i = 0; i < textLength; i++)
        {
            for (int j = 0; j < ALPHABET.length; j++)
            {
                if (text.charAt(i) == ALPHABET(j))
                {
                    frequencies(j)++;
                    break; // to speed up the computation
                }
            }
        }
        return frequencies;
    }
    public static double() findProbabilities(String text, int() frequencies)
    {
        int textLength = text.length();
        int n = frequencies.length;
        double() probabilities = new double(n);
        for (int i = 0; i < n; i++)
        {
            probabilities(i) = (double) frequencies(i)/textLength;
        } 
        return probabilities;
    }
    public static double log2(double x)
    {
        return (Math.log(x)/Math.log(2));
    }
    public static double calculateEntropy(double() probabilities)
    {
        double shannonEntropy = 0;
        int n = probabilities.length;
        for (int i = 0; i < n; i++)
        {
            if (probabilities(i) != 0)
            {
                shannonEntropy += probabilities(i)*log2(probabilities(i));
            }
        }
        return -1*shannonEntropy;
    }
    public static void main(String() args)
    {
        //final long time1 = System.currentTimeMillis();
        //String text = removeUnnecessaryChars();
        String text = convertTextToString();
        //final long time2 = System.currentTimeMillis();
        //System.out.println("Time to remove unnecessary characters: " + (time2-time1) + " ms");
        int() frequencies = findFrequencies(text);
        //final long time3 = System.currentTimeMillis();
        //System.out.println("Time to calculate character frequencies: " + (time3-time2) + " ms");
        double() probabilities = findProbabilities(text, frequencies);
        System.out.println("Shannon entropy of the E. coli genome: " + calculateEntropy(probabilities));
        String randomGene = "";
        for (int i = 0; i < 1000000; i++)
        {
            double r = Math.random();
            if      (r < 0.25) randomGene += "a";
            else if (r < 0.50) randomGene += "c";
            else if (r < 0.75) randomGene += "g";
            else if (r < 1.00) randomGene += "t";
        }
        int() rFrequencies = findFrequencies(randomGene);
        double() rProbabilities = findProbabilities(randomGene, rFrequencies);
        System.out.println("Shannon entropy of the random genome: " + calculateEntropy(rProbabilities));
    }
}

StdIn is a simple API written by the authors of the book. Here is one instance of my program:

Input: E. coli genome

Output:


Shannon entropy of the E. coli genome: 1.9998212455541713 (which is exactly compatible with the answer from Online Shannon entropy calculator)

Shannon entropy of the random genome: 1.9999979438235416


Is there any way that I can improve my program (especially its performance (especially the method removeUnnecessaryChars))?

Thanks for your attention.