# algorithms – Indexing a huge dataset (that does not fit into central memory)

The problem.

Let us consider a huge file with billions of lines, each containing a string. There are $$n$$ different strings and $$m$$ lines in the file, with $$m$$ much greater than $$n$$, although both are huge. In particular, neither the entire file nor the list of all distinct strings fit into central memory. The global file may be parsed and/or sorted (in external memory) a few times, though.

The goal is to build the file in which line $$i$$, for $$i=1..m$$, is integer $$k$$ if the string on line $$i$$ in the original file is the $$k+1$$-th distinct string appearing in the original file.

Example.

For instance, if the original file is:

``````bbb
cccc
cccc
bbb
aaa
ee
aaa
fff
dddd
cccc
bbb
fff
``````

then $$n=6$$, $$m=12$$, and we should obtain:

``````0
1
1
0
2
3
2
5
4
1
0
5
``````

My approach.

I take benefit of standard unix command-line tools, as follows.

I first build a numbered string occurence file:

``````zcat input.gz | awk '{print NR-1,\$0;}' | gzip -c > occurrences.gz
``````

I use it to build an index file for strings that respects occurence rank, and starts with character ‘-‘, lower than any occurence number:

``````zcat occurrences.gz |
sort -T. -S1g -k2,2 |
awk 'BEGIN{old="none";}{if (\$2!=old) print \$0; old=\$2;}' |
sort -T. -S1g -nk1,1 | awk '{print "-",\$2,NR-1;}' |
gzip -c > index.gz
``````

I merge the two files and sort the result to ensure that the line with a string index will be just before all other lines containing this string; I then replace each string, sort according to occurence rank, and I am done:

``````zcat occurrences.gz index.gz |
sort -T. -S1g -k2,2 -k1,1 |
awk '{if (\$1=="-") i=\$3; else print \$1,i;}' |
sort -T. -S1g -nk1,1 |
cut -d" " -f2 |
gzip -c > result.gz
``````

Technical detail: to avoid encoding issues, first set environment variable `LC_ALL` to `C`.

Question.

Is it possible to do significantly better than my approach, or to achieve similar performances with a significantly different approach, in practice?

Notice that I am not interested in minor, technical improvements, like using mawk, sed, etc, instead of awk. I am not interested in a purely theoretical solution either, that would be better asymptotically but would turn out to be so complex than no implementation is reasonably feasible.

Instead, my concern is the relevance and overall effectiveness of my approach, or variants.

For instance, would any external memory data structure be faster, while using as little memory, with available implementation?