k-Folds Cross Validation in Python

How do we evaluate the accuracy of a machine learning algorithm?

It is customary when evaluating any machine learning classification model to split the dataset into separate training and testing sets.

One convention is 80% training and 20% testing, but what if your dataset is rather small? 20% of an already minimal dataset can lead to false accuracy reporting. Furthermore, such small selections may not be truly representative of the full dataset. One of the solutions to this problem is k-folds cross validation.

The Basic Idea

If we have 100 rows in our data set, typically 20 of these rows would be selected as a testing set, leaving the remaining 80 rows as a training set.

In k-Folds Cross Validation we start out just like that, except after we have divided, trained and tested the data, we will re-generate our training and testing datasets using a different 20% of the data as the testing set and add our old testing set into the remaining 80% for training. This process continues until every row in our original set has been included in a testing set exactly once. The k in k-folds stand for how many times the new datasets are created.

An illustrated example of k-folds cross validation

Starting Code

We will begin with a base-case example of a machine learning template. Our dataset will be the famous iris dataset which I have added headings to and saved as a .csv file available here. We will try to predict the class of plant, using sepal width, sepal length, petal length and petal width. To do this we will use Gaussian Naive-Bayes from the sklearn library. We will begin by splitting the dataset via the usual 80-20 split.

	import pandas
	from sklearn.naive_bayes import GaussianNB
	from sklearn import metrics

	def import_data():
	# import total dataset
	data = pandas.read_csv('iris_data.csv')

	# split via 80-20
	row_partition = int(data.shape[0] * 0.8)
	train = data[:row_partition]
	test = data[row_partition:]

	# get a list of column names
	headers = list(train.columns.values)

	# partition data
	x_train = train[headers[:-1]]
	y_train = train[headers[-1:]].values.ravel()
	x_test = test[headers[:-1]]
	y_test = test[headers[-1:]].values.ravel()

	return x_train, x_test, y_train, y_test

	if __name__ == '__main__':
	# get training and testing sets
	x_train, x_test, y_train, y_test = import_data()

	# create and fit classifier
	classifier = GaussianNB()
	classifier.fit(x_train, y_train)

	# classify our test variables
	predictions = classifier.predict(x_test)

	# save and print accuracy
	accuracy = metrics.accuracy_score(y_test, predictions)
	print("Accuracy: " + accuracy.__str__())

view raw before_cross_validation.py hosted with ❤ by GitHub

Stratification

The above code outputs an accuracy of 93%, however; it has one major problem that becomes obvious when we look at the data contained in the actual splits:

It turns out that the original dataset is sorted based on class. So by slicing off the last 20% of rows, we are selecting only data in the "iris-virginica" category. There are many ways to remedy this, one of which is through stratification which is basically the act of including an equal share of each class in each slice. We will use the train_test_split function from sklearn to demonstrate it:

	def import_data():
	# import total dataset
	data = pandas.read_csv('iris_data.csv')

	# get a list of column names
	headers = list(data.columns.values)
	x = data[headers[:-1]]
	y = data[headers[-1:]].values.ravel()

	# partition data
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y)

	return x_train, x_test, y_train, y_test

view raw train_test_split_example.py hosted with ❤ by GitHub

Running the above code results in accuracy measures anywhere from 80% to a perfect 100% based on random chance of how the data is partitioned each time the code is run, further evidence in support of cross-validation.

Adding k-Folds Cross Validation (finally)

We will now move on to adding proper, stratified k-Folds Cross Validation. (Note: there are many ways to do this, I am just showing one of the possibilities)

	if __name__ == '__main__':
	# get training and testing sets
	x, y = import_data()

	# set to 10 folds
	skf = StratifiedKFold(n_splits=10)

	for train_index, test_index in skf.split(x, y):
	# specific ".loc" syntax for working with dataframes
	x_train, x_test = x.loc[train_index], x.loc[test_index]
	y_train, y_test = y[train_index], y[test_index]

	# create and fit classifier
	classifier = GaussianNB()
	classifier.fit(x_train, y_train)

	# classify our test variables
	predictions = classifier.predict(x_test)

	# save and print accuracy
	accuracy = metrics.accuracy_score(y_test, predictions)
	print("Accuracy: " + accuracy.__str__())

view raw 10fold_cross_validation_partial.py hosted with ❤ by GitHub

Output:

	Accuracy: 0.933333333333
	Accuracy: 0.933333333333
	Accuracy: 1.0
	Accuracy: 0.933333333333
	Accuracy: 0.933333333333
	Accuracy: 0.933333333333
	Accuracy: 0.866666666667
	Accuracy: 1.0
	Accuracy: 1.0
	Accuracy: 1.0

view raw cross_validation_output.txt hosted with ❤ by GitHub

Combining Results

We can take this one step further by combining the results from all folds into final predicted_y and expected_y lists which can then be compared to get a measure of true classifier accuracy.

	import pandas
	from sklearn.naive_bayes import GaussianNB
	from sklearn import metrics
	from sklearn.model_selection import StratifiedKFold


	def import_data():
	# import total dataset
	data = pandas.read_csv('iris_data.csv')

	# get a list of column names
	headers = list(data.columns.values)

	# separate into independent and dependent variables
	x = data[headers[:-1]]
	y = data[headers[-1:]].values.ravel()

	return x, y

	if __name__ == '__main__':
	# get training and testing sets
	x, y = import_data()

	# set to 10 folds
	skf = StratifiedKFold(n_splits=10)

	# blank lists to store predicted values and actual values
	predicted_y = []
	expected_y = []

	# partition data
	for train_index, test_index in skf.split(x, y):
	# specific ".loc" syntax for working with dataframes
	x_train, x_test = x.loc[train_index], x.loc[test_index]
	y_train, y_test = y[train_index], y[test_index]

	# create and fit classifier
	classifier = GaussianNB()
	classifier.fit(x_train, y_train)

	# store result from classification
	predicted_y.extend(classifier.predict(x_test))

	# store expected result for this specific fold
	expected_y.extend(y_test)

	# save and print accuracy
	accuracy = metrics.accuracy_score(expected_y, predicted_y)
	print("Accuracy: " + accuracy.__str__())

view raw cross_validation_example.py hosted with ❤ by GitHub

Output: 95.3% accuracy

Conclusion

As I stated above, this is just one of many ways to go about k-Folds Cross Validation. I personally find this method easiest to understand and expand upon but your mileage may vary.

Project Source:

Available Here

Search This Blog

Kent MacDonald Data Science