Univariate Linear Regression in Python


Does x predict y?

This is the basic question that linear regression aims to answer, or at least give a hint about. Technically speaking, linear regression is a way of establishing if two variables are related.

In this post we need to be familiar with the idea of both dependent and independent variables. Generally, the dependent variable or "y" is the variable that we are measuring (it can help also to frame this as the outcome). The independent variable or "x" is the variable that is modified or changed. If these two variables are at all correlated, a change in the independent variable should result in a somewhat reliable change in the dependent variable.

For example, lets say we are interested in how rainfall effects umbrella usage, we could hypothesize that the more it rains, the more likely people are to use an umbrella. In this case our independent variable is rainfall lets quantify that as mm per day. Our dependent variable is umbrella usage  and we can choose to quantify this as number of umbrellas seen per day on some particular street. It seems obvious that people would be much more likely to use an umbrella when it is raining, but lets say you actually ran this study for a month and eventually you came up with some data and plotted it on a scatterplot like so:

As you can clearly see, more rain generally equals more umbrellas. We could easily state that a correlation exists between rainfall and umbrella usage. 

If an increase in the independent variable leads to an increase in the dependent variable, it is called a positive correlation. If an increase in the independent variable leads to a decrease in the dependent variable it is called a negative correlation.

This is all fine and dandy when dealing with two variables that are so obviously related. However, when dealing with larger, less clearly related data, researchers needed a way to quantify this relationship, and objectively test if two variables are truly correlated. That is why many techniques including linear regression were developed.

Rather than writing at length about the specific details of how linear regression works, something that has been explained on countless other websites, I would encourage you to read about it somewhere online, say here and then return to the tutorial once you have a solid understanding.

Writing Code

To begin, we will need some example data to work with. Ive chosen the cricket chirps vs degrees Fahrenheit from this lovely website. All I have done is download the excel spreadsheet version and saved it as a .csv file with the name cricket_data.csv and made the headings a tad more informative. You can download the sample data here

It is generally a good idea when working with any new dataset to first visualize it. This could easily be done in excel, but lets instead do it in Python.


It certainly looks like there is a positive correlation between these two variables, so lets import sk-learn and try to quantify this relationship with linear regression.

After running the above code it reported an R^2 score of 0.69 .  Pretty good considering the scale of the data. You can read more about R^2 or the coefficient of determination here

Getting More Information

We could easily stop here, but for completeness-sake lets gather some more information about the regression.


Getting the coefficient is simple but a little bit hidden in the way sklearn implements linear regression:

Line of Best-Fit

If we want to plot the line of best fit we will need two values, the coefficient and the intercept:


Pearson Correlation Coefficient

We will calculate the Pearson Correlation Coefficient and the p-value (odds of obtaining this result by chance) in Scipy as sklearn does not support it.


In this blog post we learned the basics of single-variable regression. Much of this translates closely to multivariate regression which I will post about if their is significant interest.

Source Code and Project Repository:


  1. This comment has been removed by the author.

  2. Hi, thank you very much for new information, i learned something new. Very well written.It was so good to read and usefull to improve knowledge.Keep posting. If you are looking for any big data hadoop related information please visit our website.
    big data hadoop training in bangalore.

  3. This is the exact information I am been searching for, Thanks for sharing the required infos with the clear update and required points. To appreciate this I like to share some useful information regarding Microsoft Azure which is latest and newest,

    Data Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training

  4. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...data science courses

  5. The candidates must have a good command of mathematics and statistics to comprehend huge figures. data science course in india

  6. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.I want to share about data science courses in malaysia

  7. This is a splendid website! I"m extremely content with the remarks!.
    data scientist training and placement in hyderabad

  8. Hi Thanks for Sharing this Valuable Information with us: this is very useful for me. Keep it Up.
    ai course in aurangabad

  9. I want to leave a little comment to support and wish you the best of luck.we wish you the best of luck in all your blogging enedevors
    data science course in thiruvananthapuram

  10. Well, this got me thinking what other workouts are good for those of us who find ourselves on the road or have limited equipment options. data science training in kanpur


Post a Comment