Box Cox Transformations in Python

Many common machine learning algorithms assume data is normally distributed.

But what if your data isn't?

I experienced this frustration first hand during my undergraduate thesis, I was attempting to predict the category of online slot-machine a customer was using based on some information about their bet size, speed of play etc. Unfortunately no matter what algorithm I used or what hyper-parameter I modified, I still couldn't achieve accuracy over ~60%.

Nearing the end of the school semester I was reading about improving classifier performance when I had my "Eureka!" moment, of course non of these algorithms were performing well. When people play slot machines, the vast majority will bet the minimum stakes with only the most adventurous and financially well-off people betting significantly more. My data was indeed not normally distributed. A quick google search for "How to fix non-normally distributed data" revealed the Box Cox Transformation. A seemingly simple way to transform data to be closer to a normal distribution. After writing a simple script to perform the transformation my accuracy measures jumped to nearly 80%, an incredible 20% increase.

The Transformation

The transformation relies primarily on a lambda (ƛ) variable that holds a value between -5 and 5 that is automatically calculated to be optimal for your data. Specifically, the data is transformed in the following way:







Note: this does not hold for negative values, however; a second formulation can be used instead. Read more

Writing Code

While the transformation is a tad easier in R, we can still perform it relatively easily in Python using the SciPy Library. I will use some sample data from the Beurea of Transportation Statistics, specifically flight duration. My specific dataset is available here.

Lets begin by loading the data and visualizing it as a histogram:


Output:
This data, while it isn't horrible, is significantly skewed. Lets see if we can improve the shape a little.


Output:
The transformed data is now much more regularized and ready to be used or transformed further.

Conclusion

Performing Box Cox transformations is a powerful and elegant way of normalizing skewed data and can lead to significant improvements in machine learning performance. Our sample data transformation shows this:


Source Code



Comments

  1. Replies
    1. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. machine learning projects for final year In case you will succeed, you have to begin building machine learning projects in the near future.

      Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


      Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

      Delete
  2. what should be done when there are negative values

    ReplyDelete
  3. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.
    angular js training in chennai

    angular js online training in chennai

    angular js training in bangalore

    angular js training in hyderabad

    angular js training in coimbatore

    angular js training

    angular js online training

    ReplyDelete
  4. Machine Learning Projects for Final Year machine learning projects for final year

    Deep Learning Projects assist final year students with improving your applied Deep Learning skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include Deep Learning projects for final year into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Deep Learning Projects for Final Year even arrange a more significant compensation.

    Python Training in Chennai Python Training in Chennai Angular Training

    ReplyDelete
  5. I am really enjoying reading your well written articles. It looks like you spend a lot of effort and time on your blog. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work.

    python training in chennai

    ReplyDelete
  6. Nice blog post so thanks a lot for sharing this great blog post.. keep more post for sharing.. have a nice day.Notary Public Lawyer in Cambridge

    ReplyDelete
  7. Microsoft Office 2007 Free Download With Full Product Key. Tools for designing and drawing are included as well as animations, transitions, slideshow formats,.MS Office 2007 Download With Crack

    ReplyDelete

Post a Comment