java – Counting the number of unique IP addresses in a very large file

I made a test job to the position of Junior Java Developer. I did not receive any answer from the employer, so I would like to get a review here.

Task description

A simple text file with IPv4 addresses is given. One line is one address, something like this:

145.67.23.4
8.34.5.23
89.54.3.124
89.54.3.124
3.45.71.5.
...

The file size is not limited and can occupy tens and hundreds of gigabytes.

It is necessary to calculate the number of unique IP addresses in this file, consuming as little memory and time as possible. There is a “naive” algorithm for solving this task (we read strings and put it in HashSet), it is desirable that your implementation is better than this simple, naive algorithm.

Test case

https://ecwid-vgv-storage.s3.eu-central-1.amazonaws.com/ip_addresses.zip

WARNING! This file is about 20GB, and is unpacking approximately 120GB. It consists of 8 billions lines.

My approach

I emerge from the fact that there are 2 * 32 valid unique ip addresses. We can use a bit array of 2 * 32 bits (unsigned integer) and put each bit of this array in line with one ip address. Such an array will take exactly 512 megabytes of memory and will be allocated once at the start of the program. Its size does not depend on size of the input data.

Unfortunately, Java does not have an unsigned int type and also there are no convenient bit operations in it, so I use BitSet for my purposes.

GitHub

You can see the full code of the project with a test file to 1_000_000 addresses here:

https://github.com/chptr-one/ip-addr-counter

Thank you!

Please feel free to tell me all the flaws that you can find.

Code

Main class

public class IpCounterApp {

    private static String parseFileName(String() args) {
        String fileName = null;
        if (args.length == 2 && "-file".equals(args(0))) {
            fileName = args(1);
        }
        return fileName;
    }

    public static void main(String() args) {
        String fileName = parseFileName(args);
        if (fileName == null) {
            System.out.println("Wrong arguments. Use '-file file_name' to specify file for processing");
            return;
        }

        UniqueIpCounter counter = new BitSetUniqueIpCounter();
        long numberOfUniqueIp = counter.countUniqueIp(fileName);
        if (numberOfUniqueIp != -1) {
            System.out.println("Found " + numberOfUniqueIp + " unique IP's");
        } else {
            System.out.println("Some errors here. Check log for details.");
        }
    }
}

UniqueIpCounter interface

public interface UniqueIpCounter {

    /*
    In total there are 2 ^ 32 valid IP addresses exists.
     */
    long NUMBER_OF_IP_ADDRESSES = 256L * 256 * 256 * 256;

    /*
    Map string representing the IP address in format 0-255.0-255.0-255.0-255 to number
    in the range of 0..2^32-1 inclusive.
    It is guaranteed that the input string contains a valid IP address.
     */
    static long toLongValue(String ipString) {
        StringBuilder field = new StringBuilder(3);
        int startIndex = 0;
        long result = 0;

        for (int i = 0; i < 3; i++) {
            int spacerPosition = ipString.indexOf('.', startIndex);
            field.append(ipString, startIndex, spacerPosition);
            int fieldValue = Integer.parseInt(field.toString());
            field.setLength(0);
            result += fieldValue * Math.pow(256, 3 - i);
            startIndex = spacerPosition + 1;
        }
        result += Integer.parseInt(ipString.substring(startIndex));

        return result;
    }

    /*
    Returns the number of unique IP addresses in the file whose name is pass by the argument.
    Returns the number from 0 to 2 ^ 32-1 inclusive.
    Returns -1 in case of any errors.
     */
    long countUniqueIp(String fileName);
}

BitSetUniqueIpCounter implementation

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.BitSet;
import java.util.logging.Level;
import java.util.logging.Logger;

public class BitSetUniqueIpCounter implements UniqueIpCounter {

    private final Logger logger = Logger.getLogger("BitSetUniqueIpCounter");

    /*
    To count unique IP's use a bit array, where each bit is set in accordance with one IP address.
    In Java, there is no unsigned int and maximum BitSet size is integer.MAX_VALUE therefore we use two arrays.
     */
    private final BitSet bitSetLow = new BitSet(Integer.MAX_VALUE); // 0 - 2_147_483_647
    private final BitSet bitSetHi = new BitSet(Integer.MAX_VALUE); // 2_147_483_648 - 4_294_967_295
    private long counter = 0;

    private void registerLongValue(long longValue) {
        int intValue = (int) longValue;
        BitSet workingSet = bitSetLow;
        if (longValue > Integer.MAX_VALUE) {
            intValue = (int) (longValue - Integer.MAX_VALUE);
            workingSet = bitSetHi;
        }

        if (!workingSet.get(intValue)) {
            counter++;
            workingSet.set(intValue);
        }
    }

    @Override
    public long countUniqueIp(String fileName) {
        logger.log(Level.INFO, "Reading file: " + fileName);
        try (BufferedReader in = new BufferedReader(new FileReader(fileName))) {
            long linesProcessed = 0;
            String line;
            // If already counted 2 ^ 32 unique addresses, then to the end of the file there will be only duplicates
            while ((line = in.readLine()) != null && counter <= NUMBER_OF_IP_ADDRESSES) {
                registerLongValue(UniqueIpCounter.toLongValue(line));
                linesProcessed++;
            }
            logger.log(Level.INFO, "Total lines processed: " + linesProcessed);
        } catch (FileNotFoundException e) {
            logger.log(Level.WARNING, "File '" + fileName + "' not found", e);
            counter = -1;
        } catch (IOException e) {
            logger.log(Level.WARNING, "IOException occurs", e);
            counter = -1;
        }
        return counter;
    }
}