Box Cox Transformations in Python

Many common machine learning algorithms assume data is normally distributed.

But what if your data isn't?

I experienced this frustration first hand during my undergraduate thesis, I was attempting to predict the category of online slot-machine a customer was using based on some information about their bet size, speed of play etc. Unfortunately no matter what algorithm I used or what hyper-parameter I modified, I still couldn't achieve accuracy over ~60%.

Nearing the end of the school semester I was reading about improving classifier performance when I had my "Eureka!" moment, of course non of these algorithms were performing well. When people play slot machines, the vast majority will bet the minimum stakes with only the most adventurous and financially well-off people betting significantly more. My data was indeed not normally distributed. A quick google search for "How to fix non-normally distributed data" revealed the Box Cox Transformation. A seemingly simple way to transform data to be closer to a normal distribution. After writing a simple script to perform the transformation my accuracy measures jumped to nearly 80%, an incredible 20% increase.

The Transformation

The transformation relies primarily on a lambda (ƛ) variable that holds a value between -5 and 5 that is automatically calculated to be optimal for your data. Specifically, the data is transformed in the following way:

Note: this does not hold for negative values, however; a second formulation can be used instead. Read more

Writing Code

While the transformation is a tad easier in R, we can still perform it relatively easily in Python using the SciPy Library. I will use some sample data from the Beurea of Transportation Statistics, specifically flight duration. My specific dataset is available here.

Lets begin by loading the data and visualizing it as a histogram:

Output:

This data, while it isn't horrible, is significantly skewed. Lets see if we can improve the shape a little.

Output:

The transformed data is now much more regularized and ready to be used or transformed further.

Conclusion

Performing Box Cox transformations is a powerful and elegant way of normalizing skewed data and can lead to significant improvements in machine learning performance. Our sample data transformation shows this:

Source Code

Full Project Repository

Search This Blog

Kent MacDonald Data Science