# java – Calculating the Shannon entropy of a string (e.g. E. coli genome)

This is exercise 3.1.34. from the book Computer Science An Interdisciplinary Approach by Sedgewick & Wayne:

The Shannon entropy measures the information content of an input
string and plays a cornerstone role in information theory and data
compression. Given a string of n characters, let f(c) be the frequency
of occurrence of character c. The quantity p(c) = f(c)/n is an estimate
of the probability that c would be in the string if it were a random
string, and the entropy is defined to be the sum of the quantity -p(c)*log2(p(c)),
over all characters that appear in the string. The entropy is
said to measure the information content of a string: if each character
appears the same number times, the entropy is at its minimum value
among strings of a given length. Write a program that takes the name
of a file as a command-line argument and prints the entropy of the
text in that file. Run your program on a web page that you read
regularly, a recent paper that you wrote, and the E. coli genome
found on the website.

Here is my program:

``````public class ShannonEntropy
{
public static String removeUnnecessaryChars()
{
String text = "";
while (!StdIn.isEmpty())
{
int wordLength = word.length();
String newWord = "";
for (int i = 0; i < wordLength; i++)
{
if (word.charAt(i) != '.' &&
word.charAt(i) != '!' &&
word.charAt(i) != '?' &&
word.charAt(i) != ',' &&
word.charAt(i) != '"' &&
word.charAt(i) != ':' &&
word.charAt(i) != ';' &&
word.charAt(i) != '(' &&
word.charAt(i) != ')')
{
newWord += word.charAt(i);
}
}
text += newWord;
}
return text.toLowerCase();
}
// this method (below) is written specifically for texts without
// unnecessary characters (e.g. E. coli genome)
public static String convertTextToString()
{
String text = "";
while (!StdIn.isEmpty())
{
text = word;
}
return text;
}
public static int() findFrequencies(String text)
{
int textLength = text.length();
/*
char() ALPHABET = {'a','b','c','d','e','f','g','h','i','j','k','l',
'm','n','o','p','q','r','s','t','u','v','w','x',
'y','z'};
*/
char() ALPHABET = {'a','c','g','t'}; // specifically used for genes and genomes
int() frequencies = new int(ALPHABET.length);
for (int i = 0; i < textLength; i++)
{
for (int j = 0; j < ALPHABET.length; j++)
{
if (text.charAt(i) == ALPHABET(j))
{
frequencies(j)++;
break; // to speed up the computation
}
}
}
return frequencies;
}
public static double() findProbabilities(String text, int() frequencies)
{
int textLength = text.length();
int n = frequencies.length;
double() probabilities = new double(n);
for (int i = 0; i < n; i++)
{
probabilities(i) = (double) frequencies(i)/textLength;
}
return probabilities;
}
public static double log2(double x)
{
return (Math.log(x)/Math.log(2));
}
public static double calculateEntropy(double() probabilities)
{
double shannonEntropy = 0;
int n = probabilities.length;
for (int i = 0; i < n; i++)
{
if (probabilities(i) != 0)
{
shannonEntropy += probabilities(i)*log2(probabilities(i));
}
}
return -1*shannonEntropy;
}
public static void main(String() args)
{
//final long time1 = System.currentTimeMillis();
//String text = removeUnnecessaryChars();
String text = convertTextToString();
//final long time2 = System.currentTimeMillis();
//System.out.println("Time to remove unnecessary characters: " + (time2-time1) + " ms");
int() frequencies = findFrequencies(text);
//final long time3 = System.currentTimeMillis();
//System.out.println("Time to calculate character frequencies: " + (time3-time2) + " ms");
double() probabilities = findProbabilities(text, frequencies);
System.out.println("Shannon entropy of the E. coli genome: " + calculateEntropy(probabilities));
String randomGene = "";
for (int i = 0; i < 1000000; i++)
{
double r = Math.random();
if      (r < 0.25) randomGene += "a";
else if (r < 0.50) randomGene += "c";
else if (r < 0.75) randomGene += "g";
else if (r < 1.00) randomGene += "t";
}
int() rFrequencies = findFrequencies(randomGene);
double() rProbabilities = findProbabilities(randomGene, rFrequencies);
System.out.println("Shannon entropy of the random genome: " + calculateEntropy(rProbabilities));
}
}
``````

StdIn is a simple API written by the authors of the book. Here is one instance of my program:

Input: E. coli genome

Output:

Shannon entropy of the E. coli genome: 1.9998212455541713 (which is exactly compatible with the answer from Online Shannon entropy calculator)

Shannon entropy of the random genome: 1.9999979438235416

Is there any way that I can improve my program (especially its performance (especially the method `removeUnnecessaryChars`))?