Skip navigation.
Home

External Sorts in Java

I was given a relatively easy task, to sort a file according to two different keys in ascending order. The issue was that the file was quite large and the parsers provided created a great deal of objects, so memory was going to be an issue...

My first thought was to use Unix sort, but each record in the file is of variable lengths. The format doesn't guarantee that key_a is going to be the 13 column in each row... sometimes it is the 24th and sometimes it is the 3rd. I am sure I could create some kind of parser script in unix to find the key and then perform the sort but I already had parsers available to me in Java.

Ok so I have some parsers that will divide up each record into discrete objects. How can I sort them? I can do it very easily by creating a Comparator and then pass a list of those objects into Collections.sort(). This would work fine for a small number of reasonably sized objects. I had to deal with a million+ rows each parsed into up to 15 objects.

So next I moved to a store only the keys in memory type of solution. This way instead of 15 objects I was down to just one per row. I created a class which contained the two keys that I needed and implemented Comparable.


public class AMRULineNumberSorter implements Comparable {
    private String _baseCde, _acctNum;

    private long _lineNum;

    public AMRULineNumberSorter(long lineNum, Record record) {
        _lineNum = lineNum;
        _baseCde = record.getKey1();
        _acctNum = record.getKey2();
    }

    public long getLineNum() {
        return _lineNum;
    }

    public String toString() {
        return Long.toString(_lineNum);
    }

    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if ((obj == null) || (obj.getClass() != this.getClass()))
            return false;
        AMRULineNumberSorter test = (AMRULineNumberSorter) obj;
        return _baseCde.equals(test._baseCde) &&
               _acctNum.equals(test._acctNum);
    }

    public int hashCode() {
        int hash = 7;
        hash = 31 * hash + _baseCde.hashCode();
        hash = 31 * hash + _acctNum.hashCode();
        return hash;
    }

    public int compareTo(Object obj) {
        if (this.equals(obj))
            return 0;
        AMRULineNumberSorter test = (AMRULineNumberSorter) obj;
        int ret = this._baseCde.compareTo(test._baseCde);
        if (ret == 0) {
            ret = this._acctNum.compareTo(test._acctNum);
        }
        return ret;
    }
}

The class works pretty well and solves the 15 objects/row problem.. but what about the 1 million rows issue? At 165K rows my memory usage was 124M or 750b/row. So at 1M rows * 750b/row == 750M necessary Evil

Okay so now what? To conserve memory, I started thinking about disk-backed solutions or in other terms externalized sorting solutions.

One solution essentially entailed dividing the key sets into some number of buckets stored on disk.. sort each bucket.. then combine the buckets in the proper order to achieve the full sorted file. I stumbled onto this in my googling adventures.. and thought maybe a Map keyed by object hash (see hashCode() above) with a value of original line number might work. This is pretty much how the ConcurrentHashMap works but the ConcurrentHashMap is only in memory. To implement this I would need to write an implementation of Map that probably extended ConcurrentHashMap and implemented the buckets as files.

Another idea that popped into my mind was to make the input file Unix sort compatable..

  1. Read the file line by line
  2. Parse into objects
  3. Write to a new file: key1key2|original line
  4. Sort via unix or maybe a text editor like Ultraedit using that key
  5. Strip off everything before the |original line.

The last solution that I came up with is to use a database to sort the data by..

  1. Read the file line by line.
  2. Parse into objects.
  3. Write key1key2 into KEY field and original line into DATA field.
  4. Select DATA from TABLE ORDER BY KEY.
  5. Write new input file.

The worst part about the whole thing is that after taking a solid morning to try and sort the data... I ran a few test runs (using a small data set and the Comparable class) and found out that the data was already sorted in the correct order! Laughing out loud I guess I should have asked that question to begin with.

well.. why not following

well.. why not following the obvious choice? Set large memory parameters . since this is a one time task, just do the job..

Matt Fleming's picture

Re: well.. why not following

I hear what your saying.. since my work PCs are win2k, I can run the app using javaw and the proggie will suck up all the memory available in windows. But for whatever reason, the task got me thinking about alternatives and I guess I was in a it sucks RAM is so cheap frame of mind..Eye-wink

I was hoping to read about

I was hoping to read about an actual external sort implementation in java, like with B trees or multiple passes with merging.

Anyhoo, my life has recently been made easier by applying sed and sort to a problem much like yours.

forgot to mention how much

forgot to mention how much easier. these were csv files with 20M rows, weighing in at a few gigs.

Matt Fleming's picture

that's where I was going until..

I realized that files were already sorted the proper way. I was looking into sort algorithms and was putting together a plan on how to sort pieces then merge.. when I realized two things:

  1. There isn't any Java OSS that does this.. some commercial products seem to.
  2. Doing this in unix is way easier.

Sorting

What's wrong with using a database?

Matt Fleming's picture

Re: Sorting

You could use a db for this kind of thing. In DB2, you could setup a clustering index and just simply insert the all of the rows. The data would be physically stored in the correct order, then do a fetch all. Or you could just bulk insert the data as is and fetch with an order by. It just seems like a waste.. pushing the data across the network, then fetching it all back, table setup, table storage, writing a program to parse, insert and fetch properly. If you use the order by solution, the db subsystem is doing the same thing that you would do in unix.. a disk/memory backed sort. In short, simplicity is why I didn't opt for a db solution.