k-Folds Cross Validation in Python

How do we evaluate the accuracy of a machine learning algorithm?

It is customary when evaluating any machine learning classification model to split the dataset into separate training and testing sets.

One convention is 80% training and 20% testing, but what if your dataset is rather small? 20% of an already minimal dataset can lead to false accuracy reporting. Furthermore, such small selections may not be truly representative of the full dataset. One of the solutions to this problem is k-folds cross validation.

The Basic Idea

If we have 100 rows in our data set, typically 20 of these rows would be selected as a testing set, leaving the remaining 80 rows as a training set.

In k-Folds Cross Validation we start out just like that, except after we have divided, trained and tested the data, we will re-generate our training and testing datasets using a different 20% of the data as the testing set and add our old testing set into the remaining 80% for training. This process continues until every row in our original set has been included in a testing set exactly once. The k in k-folds stand for how many times the new datasets are created.

An illustrated example of k-folds cross validation

Starting Code

We will begin with a base-case example of a machine learning template. Our dataset will be the famous iris dataset which I have added headings to and saved as a .csv file available here. We will try to predict the class of plant, using sepal width, sepal length, petal length and petal width. To do this we will use Gaussian Naive-Bayes from the sklearn library. We will begin by splitting the dataset via the usual 80-20 split.

Stratification

The above code outputs an accuracy of 93%, however; it has one major problem that becomes obvious when we look at the data contained in the actual splits:

It turns out that the original dataset is sorted based on class. So by slicing off the last 20% of rows, we are selecting only data in the "iris-virginica" category. There are many ways to remedy this, one of which is through stratification which is basically the act of including an equal share of each class in each slice. We will use the train_test_split function from sklearn to demonstrate it:

Running the above code results in accuracy measures anywhere from 80% to a perfect 100% based on random chance of how the data is partitioned each time the code is run, further evidence in support of cross-validation.

Adding k-Folds Cross Validation (finally)

We will now move on to adding proper, stratified k-Folds Cross Validation. (Note: there are many ways to do this, I am just showing one of the possibilities)

Output:

Combining Results

We can take this one step further by combining the results from all folds into final predicted_y and expected_y lists which can then be compared to get a measure of true classifier accuracy.

Output: 95.3% accuracy

Conclusion

As I stated above, this is just one of many ways to go about k-Folds Cross Validation. I personally find this method easiest to understand and expand upon but your mileage may vary.

Project Source:

Available Here

Comments

pg9 May 2026 at 22:30
Evaluating the accuracy of a machine learning algorithm is a key step in determining how well a model performs on unseen data. In Machine Learning, this is typically done by splitting the dataset into training data (to train the model) and testing data (to evaluate it). The simplest metric is accuracy, which is the ratio of correctly predicted instances to the total number of predictions. For example, if a model correctly predicts 90 out of 100 cases, its accuracy is 90%. However, accuracy alone may not always be reliable, especially when dealing with imbalanced datasets.

Search This Blog

Kent MacDonald Data Science