Box Cox Transformations in Python

Many common machine learning algorithms assume data is normally distributed.

But what if your data isn't?

I experienced this frustration first hand during my undergraduate thesis, I was attempting to predict the category of online slot-machine a customer was using based on some information about their bet size, speed of play etc. Unfortunately no matter what algorithm I used or what hyper-parameter I modified, I still couldn't achieve accuracy over ~60%.

Nearing the end of the school semester I was reading about improving classifier performance when I had my "Eureka!" moment, of course non of these algorithms were performing well. When people play slot machines, the vast majority will bet the minimum stakes with only the most adventurous and financially well-off people betting significantly more. My data was indeed not normally distributed. A quick google search for "How to fix non-normally distributed data" revealed the Box Cox Transformation. A seemingly simple way to transform data to be closer to a normal distribution. After writing a simple script to perform the transformation my accuracy measures jumped to nearly 80%, an incredible 20% increase.

The Transformation

The transformation relies primarily on a lambda (ƛ) variable that holds a value between -5 and 5 that is automatically calculated to be optimal for your data. Specifically, the data is transformed in the following way:







Note: this does not hold for negative values, however; a second formulation can be used instead. Read more

Writing Code

While the transformation is a tad easier in R, we can still perform it relatively easily in Python using the SciPy Library. I will use some sample data from the Beurea of Transportation Statistics, specifically flight duration. My specific dataset is available here.

Lets begin by loading the data and visualizing it as a histogram:


Output:
This data, while it isn't horrible, is significantly skewed. Lets see if we can improve the shape a little.


Output:
The transformed data is now much more regularized and ready to be used or transformed further.

Conclusion

Performing Box Cox transformations is a powerful and elegant way of normalizing skewed data and can lead to significant improvements in machine learning performance. Our sample data transformation shows this:


Source Code



Comments

  1. what should be done when there are negative values

    ReplyDelete
  2. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.
    angular js training in chennai

    angular js online training in chennai

    angular js training in bangalore

    angular js training in hyderabad

    angular js training in coimbatore

    angular js training

    angular js online training

    ReplyDelete
  3. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    python training in chennai

    ReplyDelete
  4. Nice blog post so thanks a lot for sharing this great blog post.. keep more post for sharing.. have a nice day.Notary Public Lawyer in Cambridge

    ReplyDelete
  5. Microsoft Office 2007 Free Download With Full Product Key. Tools for designing and drawing are included as well as animations, transitions, slideshow formats,.MS Office 2007 Download With Crack

    ReplyDelete
  6. Got to know something new reading your blog and thanks for sharing this with us. Great reading your blog.

    IELTS Coaching in Chennai

    ReplyDelete
  7. It as very interesting to read.Thanks for sharing it with us.
    Python course in Pune

    ReplyDelete

Post a Comment