Java for Data Scientists and Statisticians
A Gentle Introduction (Draft)

Programming in Java for statisticians should focus on what matters most to the investigator, dealing with data! We will not spend as much time as a typical manual on the intricacies of programming (inheritance, constructors, garbage collection, etc.) You can find many books that do an ample job of teaching you those more in depth topics should you wish to pursue them. There are also many resources available online to help you along your way.

Assuming you have a background in R or Python, I will try to relate many of the concepts familiar to you from those perspectives. For example, in Python, most data is contained either in a dictionary (a sort of lookup table), or a list (a sequence of elements). We will cover the corresponding data structures in Java to help you perform your day to day tasks just as you would in one of these other environments.

Both Python and R (or even SAS) are interpreted languages, whereas Java is a compiled language. At first, this may seem like a hindrance, since in both Python and R you can make simple changes to your code and see the results immediately. In Java, you must first compile your code before it is able to run. This extra step years ago made developing in compiled languages (such as C, C++, Java, etc) a slower process. However, with the use of a modern IDE (Integrated Development Environment) such as Eclipse, this is not as overly burdensome as you might think – unless you are dealing with a very large and complex application. For the most part, if you are using Java to process data in the same manner you would with Python or R, you should see rather quickly that running your program inside the IDE performs almost as quickly as if you were in RStudio or the IPython notebook (http://ipython.org/notebook.html) editing environments.

One word of warning, whether you have been programming in R or Python for very long, you are likely accustomed to editing your code in a very procedural manner. That is, you might outline your data pre-processing step in the first few lines of code, making sure the features you need are available for analysis, and then you might move directly to the next line to perform a regression analysis. While this form of programming will still work in Java, I would urge you to begin thinking more in the Java way of operating. The power of object oriented programming will hopefully become apparent as we progress, and you may start to see that there are portions of your code which you start to duplicate over and over again – these are ripe for extraction to a class and a set of functions that you can easily parameterize and make easier use of in the future. Good design in both the organization of not only your data, but the processes you put them through will help to both ease your re-use of those methods in the future, as well as help document exactly what it is your code is doing.

Finally, we will touch on some of the publicly available libraries to help you on your way. Many have said that languages such as R and Python are “batteries included” which is a short way of saying that many of the features you need and expect to become quickly efficient in the use of these environments are built right into the language itself. In fact, the maintainers of the Python language have stated that they wish to move many of the standard statistical methods and algorithms directly into the language, rather than having to import those methods from a third party library. (see https://www.python.org/dev/peps/pep-0450/ for more info on this).

While R was built for the sole purpose of facilitating the needs of statisticians, and has a great community behind the CRAN package system, I argue that for large scale production needs, R still falls short unless you are fortunate enough to be working with an alternative like Revolution R to enable parallel processing. http://www.revolutionanalytics.com/revolution-r-enterprise

However, I would argue that the R environment is still not an end user friendly toolkit, and likely not a tool that you would want to put into an end users hand to have to run their own analysis without quite a bit of training. I would argue the same for Python since not every machine necessarily has a Python interpreter installed.

Of course with some work you could build web interfaces for either of these, but Java has dominated the web application space for longer than either of these alternatives, and furthermore, Java has added several capabilities in the latest language revision (Java 8) to enable big data analytics and to take advantage of parallel processing with relatively little effort. We will get into this as we go along!

Beginning Java

Constructing Arguments

Like any compiled language, you must first tell the system where to start. This is usually called your “main” method. This lets the machine know that “Hey! I want you to start executing my program here.” - and if you have ever used a command line interface (CLI), either a dos box on Windows, or a terminal on a Mac or Linux machine, you know that most programs that you can type in will take a list of parameters following the command line call.

$ grep “foo” bar.txt

In this example, we are calling the grep command (which is a common utility found on most Linux and Mac systems) that enables you to search for a string in a file. Here we are searching for any occurrence of the word “foo” in the file “bar.txt”. Since we did not provide any other parameters to grep, it will search for “foo” exactly as we had typed it in. If we wanted to say, ignore capitalization, grep takes an additional parameter before the search string:


$ grep -i “foo” bar.txt

Now, we have told grep that we do not care if the file contains “foo”, “FOO”, or “Foo” - just tell us if it is there and ignore any variances in the way it was capitalized.


In the same way, our first Java program will setup a main method and take a single parameter, your name!

We will start by creating a simple program called “stats_p1.java” which you can find in the code repository [code_link].

If you are using Eclipse, you can provide arguments to your program by clicking on the “Run Configuration” button and going to the “Arguments” tab. When you then “Run” your program, Eclipse will provide this as a parameter to your program just as if you were running it directly from the command line. This is quite useful if you want to setup various tests of arguments, and you can then save them under different run configurations which you can easily call from Eclipse directly.

For now, let's go through a small code example.


public class stats_p1 {

	/**
	 * This is the main method, and as you can see it takes
	 * an array of strings as it's arguments
	 * @param args
	 */
	public static void main( String [] args )
	{
		// We only want to succeed if the user provided a single argument
		if ( args.length != 1 )
			System.out.println("I am sorry, but this program only takes a single " + 
				"parameter. Please try again and provide only your name");
		else
			System.out.println("Hello there my friend " + args[0]);

		// Exit without fanfare
		System.exit(0);
	}
	
}

After setting my argument to “Jeffery”, I receive the following output on the console tab of Eclipse.

Hello there my friend Jeffery

Think of the console tab similar to the R console window. Anything you want to output is written to the console when you call System.out.println – I am sure you notice that this is a lot of text to type to simply print something to the console. We can write a function to simplify this process.

Here is a second test demonstrating both a function call and adding a reference to another object.



public class stats_p2 {

	public static void log(String msg)
	{
		// Add the current date/time to our message before printing
		msg = new java.util.Date().toString() + " " + msg;
		System.out.println(msg);
	}
	
	/**
	 * This is the main method, and as you can see it takes 
	 * an array of strings as it's arguments
	 * @param args
	 */
	public static void main( String [] args )
	{
		// We only want to succeed if the user provided a single argument
		if ( args.length != 1 )
			log("Please try again and provide only your name");
		else
			log("Hello there my friend " + args[0]);

		// Exit without fanfare
		System.exit(0);
	}
	
}

We have created a function called “log” which takes a string as it's only parameter. Now we are cooking with gas. Rather than typing out the long statement to print to the console, we can easily just reference our new function log(“This is my message”); and it does the work for us. We have also called the java.util.Date object. This is a built in data type that Java provides to help us work with dates. By default, when you create a new Date object, it automatically sets itself to the current date and time. The date object also has a function called “toString()” which we can use to convert the object to a string which is easy to print to the screen, and finally we used the plus (+) operator to concatenate the date, a space, and our message all out to the console.

The definition of the function requires some explanation. First, every function must define what type of data it will return. In this case, we are not returning any data, so it is void. Second, because of the object oriented nature of Java, a function is usually associated with some class (that could be a class describing an employee, a shape, or a financial transaction). When you want a function that is not tied specifically to any class, you must call that function “static” – that is, it can be called directly without any reference to a specific class instance.

Later, we will define functions that return various data types. Java supports strings, integers, doubles, boolean and many others. Every variable you use in Java must have its data type predetermined. This will seem like a lot of work at first, but it will help to insure that your program does not perform in an unexpected way, and that if you get data you did not intend to receive, the machine will give you informed error messages to help you pinpoint your bugs.

Eclipse provides two running modes. One is just “Run” – that is, it takes your code, compiles it, and then executes. The other mode is “Debug” and debug mode should become your best friend. It is true that you can use functions like “log” to inform you on the console of what is going on in your code, but when you run in debug mode, you can step through each line of your code in the debugger and inspect the value of the variables at each step as you go through the process. It is possible to debug your R code as well if you are running in RStudio. Python also has various debuggers to help you similar to Java, and if you keep programming in either Python or R, I would recommend learning about using the debuggers there as well.

Control statements

We have already introduced you to the most important control statement when we tested whether the user provided a single parameter, or something else. This is the if/else statement. Python definitely has if/else control statements, but they are setup a little differently in Java.

In Python you might write a code snippet like the following:


if (x in data.keys() == True ):
	x = x * x
else:
	x = x * x * x

This is a simple method to test if some variable x exists in a data dictionary called “data” - you notice that Python uses a capitalized True or False for boolean tests. This is one of the trickiest things to remember, since Java uses all lower case forms of “true” or “false” instead.

In this simple example, if the variable x is in our data set, we want to square it, otherwise, raise it to the third power.

How would you write this in Java? You would do it in much the same way. We will cover some data types in further detail later, but let's just assume you know there exists a data set similar to the above Python data dictionary.


if ( data.containsKey(x) == true )
{
	x = x * x;
} else {
	x = x * x * x;
}

In Python, the parentheses are not required for your if statement, but they are required in Java. In Python, you end the if statement with a colon, in Java, you block of the part of code you want to run if the statement is true between two braces, and similar for the else portion that follows. Finally, in Java, every line of code MUST end with a semicolon to tell the compiler you are done with that line.

Python depends on indentation to organize the code. This can become quite tedious over time without a good editor to help you. In Java, the use of whitespace and indentation is not a requirement, but rather a nicety to help you layout and organize your code.