Data Parsing
Managing Data Imports in Java (Draft)

Inevitably, any task you set about as a data scientist will involve some level of data loading and manipulation. You will most likely find yourself spending considerable time making sure that your data is correctly formatted before you can proceed with any type of real processing or analytics task.

In the past, I was a huge fan of Perl and Python for their simple yet effective string manipulation operations. If you have ever coded in one of these environments, I am sure you will agree.

However, over the years, I have been burned several times by not picking the right data structure to adequately hold my data while reading it from a text file, or some other simple mistake which when operating over large data sets can have significant consequences to your overall processing time.

Earlier this year, I was burned for the last time. I was attempting to clean some data we had received dealing with social media, and the data file had quite a lot of inconsistencies, various encoding formats and other problems which Python just began to become more of a hindrance than help.

It was taking upwards of an hour or two to get through most of the data set when bam! It would fail and I would have to start all over again debugging what went wrong. It turned out to be a character encoding problem which Python was just not sufficiently capable of dealing with.

I ended up re-writing all of my Python code back into Java in about 30 minutes and tried again. To my surprise, I was able to process the entire data file (around 300,000 entries) in a little under a minute. I vowed never to try to process my text in Python again and to take the time to write clean and manageable Java code to process any of the new social media data we might receive.

So you may ask, how do I import my data using Java? You could go the old route of reading in a text file line by line, and splitting the lines based on a delimiter, but you really should not go down this path since the Apache Software Foundation has already done most of the hard work for you by way of the Apache Commons CSV library. This library offers the ability to parse many types of data files quickly and efficiently with little setup involved.

Whether you are dealing with a comma delimited file (standard CSV format) or many others, the commons-csv parser will more than likely be able to help you solve your data import issues.

If you just want to read in a text file of single values, say one value per line, then I would still recommend reading the file in line by line and storing to an appropriate data structure such as an array or list of values.

We will start with the simple case first. Say you have a single data file called names.txt which list all the names of your closest enemies.

The entries in the file might look something like the following:

Hannibal Lecter
Norman Bates
Darth Vader
HAL 9000
Terminator
The Joker

And you want to store these villains in a list called appropriately “villains”. How do we do this?

First we should define the structure that will hold our list of villains. You could make an array if you knew exactly how many villains we had and this would give you the optimal use of both memory and efficiency with regards to accessing the villains later on.

In Java, we define an array of a basic type such as a string with the following construct.


public static void main(String [] args)
{
	// Create a string array of size 30 to store our villains in
	String[] villains = new String[30];
	
	// Now we could add entries one by one
	villains[0] = "Hannibal Lecter";
	villains[1] = "Norman Bates";
	villains[2] = "Darth Vader";
	villains[3] = "HAL 9000";
	
	// and so on...
	System.out.println("Number of villains: " + villains.length);
	
}

The output of this code tells us that the number of villains is 30, but this is not quite what we were looking for. We have only added 4 villains to the array. Also, does not bode well for us if we have say 3000 villains we would like to store. We need a different data structure that is capable of growing in infinite length to support a potentially unknown number of villains in our data file.

Now let's look at some code to open the file and read it in line by line. Don't let the length of the code get you down. We can show you how to shorten this later on.



	public static void main(String [] args)
	{
		
		// All file operations can potentially throw an exception which must be caught
		try {
		
			// We will use an ArrayList to store the villains
			// The diamond operator is a Java 7 shorthand that allows us to not have to duplicate
			// the type checking of our generic list of Strings
			// You could also have specified here new ArrayList<String>()
			// So long as both the left hand and right hand sides match
			// The left hand side <String> is required for strong type checking
			List<String> villainsList = new ArrayList<>();  
			                                                
			                                                
			
			// This defines the location of our file relative to our project
			String dataFile = "./data/villains.txt";
			
			// Now we need a file reader to open the file
			// For now, just know that a buffered reader takes a file reader 
			// and provides a nice interface for going through the file
			// one line at a time
			BufferedReader br = new BufferedReader( new FileReader( dataFile) );
			
			// This is a variable to store the contents of each line
			String line;
			while ( ( line = br.readLine() ) != null )
			{
				// we know there is a single entry per line we are intersted in
				// clean it up just in case there was extra space 
				// padding the villain's name in the text file
				String myVillain = line.trim();
				
				// Add this villain to the list!
				villainsList.add(myVillain);
			}
			
			// Close the file once we are done reading the input
			br.close();
			
			// Now how many villains do we have?
			System.out.println("Number of villains loaded: " + villainsList.size());
			
			// Output the first 10 villains
			for ( int i = 0; i < 10; i++ )
			{
				System.out.println("Villain [" + (i+1) + "] " + villainsList.get(i) );
			}
			
		
		} catch ( FileNotFoundException e ) {
			log.error("Error - the data file could not be located.");
		} catch ( IOException e )
		{
			log.error("Error - the data file could not be read.");
		}
	}

Now, we have created a Java collection of type List and instantiated it with a specific type of list called ArrayList. The ArrayList class is very efficient for adding lots of elements. If you need to add and remove elements potentially out of order, then a LinkedList might be a more appropriate data type. For now, we will just work the the ArrayList as it is pretty simple to understand. Another big benefit is that we did not need to tell Java how big to make our ArrayList. It will automatically grow the list to be large enough for as many elements as we need to add.

The output of this code will now tell us the total number of villains in our text file and output the first ten.

Number of villains loaded: 26
Villain [1] Dr. Hannibal Lecter
Villain [2] Norman Bates
Villain [3] Darth Vader
Villain [4] Nurse Ratched
Villain [5] Mr. Potter
Villain [6] Queen Grimhilde
Villain [7] Michael Corleone
Villain [8] Alex DeLarge
Villain [9] HAL 9000
Villain [10] Annie Wilkes

There is a lot going on here, so we will try to break it down. First, if you are new to Java, you may be wondering what the System.out.println(“”) method is all about. This is an easy way to get output to your console window in Eclipse or any other IDE (Integrated Development Environment). The console is the primary way of communicating with your Java program if you are not working with a GUI (graphical user interface). The System.out.println(“”) method takes a string of input and prints it to your console.

You can also take user input by reading more about System.in.readline()

There is a great demo for reading input from a console available online here: Reading Strings from the Console

However, in most of our applications, we are building web based applications and not having users interact directly with the console, but if you need it, there it is.