### Jeff's Blog

- September 2015
- Nov 2015

Sample Means Example

Typically, when one thinks of Big Data and analytics, parallel processing seems to always rear its ugly head. One of the great things about the new language features built in to Java 8 is its ability to make parallel processing almost laughably easy. Be warned though, as I will show in this article, running certain methods in parallel may not be all it is cracked up to be.

The problem arises in how parallel computing processes are managed. This is true whether you are working in Java or any other performance computing environment. There is always going to be some overhead involved in parallelizing your computation tasks.

Therefore, you must first understand some basics of how to interrogate your programs to determine when and where the parallelization of certain tasks makes sense. Undoubtedly, in order for any task to be parallelizable, you must be able to break it down into discrete and reasonably independent tasks. Here, we will describe some of the new built in data streams available in Java 8 and how you might compare those to some other, direct ways of programming the computation of an arithmetic mean on your data set.

To begin, we must first construct a sample data set of reasonable size to push our system and measure the impact of various programming decisions on your code. While the example shown below only results in differences of milliseconds, those can add up quickly and mean the difference between your application responding quickly and efficiently to user demands and making your users angry that they have to continuously wait on your code to complete before returning access back to the user.

It should also be noted that performance tuning should come only after you have really
understood the problem you are trying to solve and that you are returning the most
accurate values possible before going down the path of optimization. As
Dr. Knuth said best, "*premature
optimization is the root of all evil.*"

Let's dig in to the following example code. The task here is fairly straight forward. We would like to know how long it takes to compute the mean over the set of one million random data points using three different methods:

- Direct Computation
- Apache commons-math descriptive statistics package
- Java 8 DoubleStream parallel computation method

// // Library Imports // // Built in Java Date object for logging import java.util.Date; // Java 8 Streams package import java.util.stream.DoubleStream; // The Apache commons-math library for simple statistics import org.apache.commons.math3.distribution.NormalDistribution; import org.apache.commons.math3.stat.descriptive.DescriptiveStatistics; public class MeanTest { /** * Simple logging method is a wrapper to print out our * messages to the console with a date and time stamp * * @param msg */ public static void log(String msg) { System.out.println( new Date() + ": " + msg); } /** * * Method in which we see how you can measure your code * for system performance when needed. * * @author: Jeffery Painter * @param args */ public static void main(String [] args) { log("Begin"); // // Create a data set consisting of 1,000,000 entries // // Since Java 7, you can now insert the underscore // symbol to denote places in numbers // which makes it easier to spot check your code! double [] data = new double[1_000_000]; // // Create all variables before we start // double mean = 0d; double variance = 0d; double sd = 0d; double sum = 0d; double sampleSize = 0d; long startTime = 0l; long endTime = 0l; long computeTime = 0l; // // Populate the data with a random number set // from a normal distribution with mean = 50 and sd = 15 // // The 'd' character following the number indicates to Java // that we wish the number entered to be of type double // You can assert numeric types in this way for // floats 'f', doubles 'd', and longs 'l' // Otherwise, the number will default to a primitive // type of integer 'i' // NormalDistribution normalDistribution = new NormalDistribution(50d, 15d); log("Begin populating our data array"); for ( int i = 0; i < data.length; i++ ) data[i] = normalDistribution.sample(); log("Data loaded: " + data.length); // ----------------------------------------------------- // Compute the arithmetic mean directly // ----------------------------------------------------- log("Direct computation"); // nanoTime() Returns the current value of the most precise // available system timer, in nanoseconds. startTime = System.nanoTime(); // A variable to hold the sum of our data elements sum = 0d; // Compute the sum for ( int i = 0; i < data.length; i++ ) sum = sum + data[i]; // Set the sample size as a type double to // maintain precision in division sampleSize = (double) data.length; // Compute the arithmetic mean mean = sum / sampleSize; // Set end time for this section endTime = System.nanoTime(); computeTime = (endTime - startTime)/(long)1e6; log("Time to compute mean: " + computeTime + "ms - Value: " + mean); // ----------------------------------------------------- // ----------------------------------------------------- // Use the Apache commons-math library to calculate // the mean, variance and standard deviation // ----------------------------------------------------- log("commons-math computation"); DescriptiveStatistics statistics = new DescriptiveStatistics(data); // Arithmetic Mean startTime = System.nanoTime(); mean = statistics.getMean(); endTime = System.nanoTime(); computeTime = (endTime - startTime)/(long)1e6; log("Time to compute mean: " + computeTime + "ms - Value: " + mean); // Variance startTime = System.nanoTime(); variance = statistics.getVariance(); endTime = System.nanoTime(); computeTime = (endTime - startTime)/(long)1e6; log("Time to compute variance: " + computeTime + "ms - Value: " + variance); // Standard Deviation startTime = System.nanoTime(); sd = statistics.getStandardDeviation(); endTime = System.nanoTime(); computeTime = (endTime - startTime)/(long)1e6; log("Time to compute std deviation: " + computeTime + "ms - Value: " + sd); // ----------------------------------------------------- // Now attempt to use Java 8 Streams for parallel computing // ----------------------------------------------------- log("Java8 Streams"); // // Set up the data into a Java 8 DoubleStream that will // take advantage of parallel processing capabilities // DoubleStream dStream = DoubleStream.of(data).parallel(); startTime = System.nanoTime(); mean = dStream.summaryStatistics().getAverage(); endTime = System.nanoTime(); computeTime = (endTime - startTime)/(long)1e6; log("Time to compute mean: " + computeTime + "ms - Value: " + mean); } }

The output of this code follows. your computation times may vary depending on your system hardware, but this should illustrate my point rather clearly.

Fri Nov 06 16:13:29 EST 2015: Begin Fri Nov 06 16:13:29 EST 2015: Begin populating our data array Fri Nov 06 16:13:29 EST 2015: Data loaded: 1000000 Fri Nov 06 16:13:29 EST 2015: Direct computation Fri Nov 06 16:13:29 EST 2015: Time to compute mean: 6ms - Value: 50.00568115050615 Fri Nov 06 16:13:29 EST 2015: commons-math computation Fri Nov 06 16:13:29 EST 2015: Time to compute mean: 17ms - Value: 50.005681150505424 Fri Nov 06 16:13:29 EST 2015: Time to compute variance: 13ms - Value: 225.02449104000374 Fri Nov 06 16:13:29 EST 2015: Time to compute std deviation: 5ms - Value: 15.00081634578611 Fri Nov 06 16:13:29 EST 2015: Java8 Streams Fri Nov 06 16:13:29 EST 2015: Time to compute mean: 154ms - Value: 50.005681150505424

As you can see, the direct method had the highest performance running on various tries as low as 6ms and as high as 11ms, but why? The good news is that the mean value of all three methods is in agreement, so what is causing the other methods to take nearly twice as long for the Apache commons-math library and anywhere from 10-17x as long for the parallel method? Shouldn't parallel processing help speed up our performance and not slow it down? This should give you some moment's pause to understand that parallel processing does not solve all of our problems when it comes to Big Data if not appropriately applied.

I have to say I was also quite surprised to find that the Apache library took twice as long as our direct method to compute a simple mean. This led me to investigate, and which is also one of the great benefits of using an open source library like Apache commons. You can inspect any of the code in the Apache commons-math either by going to Google and typing in the name of the class you are interested in and appending ".java" to the end of the class name.

Or, you can simply download the source code of the libraries and attach it directly within your Eclipse project. Once you have done this, any of the methods available within the Apache commons libraries will be accessible directly in your project. To find the code defining a particular function, you can double click on the method name, and once it is highlighted in Eclipse, you can press the F3 key on your keyboard to go to the method directly into the commons source code.

Upon inspection of the Mean.java class, we can understand why it takes nearly twice as long to compute the mean as our direct method. The documentation of the code is very well done and tells us exactly what we need to know.

The Apache library does a two-pass analysis of your data set in order to correct for variations as described in this article. Now we see that since the algorithm evaluates all of the data points exactly two times, it makes sense that our code runs in half the time.

The system call to **nanoTime()** is extremely useful for performance tuning. As you can see from my output
above, if we were to simply compare between the logging date time outputs, it would appear that
all of the code ran in less than a one-one hundredth of a second! **nanoTime()** gives you the precision
you need to really measure system performance.

By storing the **nanoTime()** before and after the portions of code we are interested in,
we can compute as close as possible to the actual number of milliseconds (by dividing the number of
nanoseconds by 1^-6 or in java expressed as `(1/1e6)`

So why does the parallel stream method take longer than the other two? This is because in order to setup the parallel task, Java must look at the data, make a decision about the best way to split the data up for parallelization and then create and manage the threads to facilitate this task. All of this creates some overhead for the JVM which incurs a cost of both time and system resources to then bring all of the results back together into a final result.

For many tasks, you will see a direct performance boost from parallelization, but be careful not to overuse it and measure your performance to have proof of when it makes sense and when it does not.

Thanks for reading the article in full. I hope the example code above will get you started on evaluating the performance of your own code, and determine when running a task in parallel will really give you the boost you are looking for, or as here, when you should think of attacking your programming issues in another way.

As always, please feel free to send me your feedback of what you liked and didn't like about this article. Also, you can sign up on the JavaStats forum to discuss further.

Happy coding!