Happy Hour of Code week! Today I am going to talk about a really fun topic. Machine learning! What is machine learning?
Have you ever seen a Google car? Have you looked in your email and noticed that some of your emails go directly to your inbox, while others are classified as spam? These are all examples of machine learning in action.
Your computer is not very smart without you. It does what you tell it to do, well and very fast. But can it learn to do other things? To recognize patterns, just as humans do?
In this post, I am going to show you how to use a graphical software called WEKA (after a really cute animal) that runs machine learning algorithms. Together, we will teach your computer to recognize different types of Iris flowers.
Install Java if you already do not have it (Java comes pre-installed on Mac). Now let’s download WEKA for your operating system from the WEKA official site. You can either download a Stable version, or a Developer version. I ended up downloading the stable version since its is… ahem… less likely to crash on you.
When the download is finally done, run the WEKA program, and you will see the following window with the picture of adorable bird from New Zealand. Why does all wild life in NZ and Australia have to be so cute?
But I degrees. Let’s click on Explorer button and when a new window shows, Open File… button. This will let you choose the dataset we want to explore.
If you look at the WEKA folder that you downloaded, there should be a data folder inside it with several different datasets in it. Let’s pick the iris.arff dataset, a very famous dataset which classified iris flowers into different classes based on flower properties, such as petal length. Iris dataset is famous enough to have its own Wikipedia page! If you are curious to peek inside the iris.arff, you will see that the dataset consists of several rows of four numbers, followed by an Iris flower class. We will be using the four numbers and the class first to train the classifier to distinguish between flower types, and then to test it. When you train the classifier, you train with only a part of your dataset, and use the remaining data to classify. This is exactly what the algorithm will do.
Let’s do some classification magic and switch to Classify tab in WEKA. It should already have ZeroR classifier selected, as well as cross-validation with 10 folds. Ten-fold cross validation means that you split the dataset into 10 parts, train the algorithm with 9 parts, where the algorithm knows what the answer should be. Once the algorithm is trained, you use the last 10th part to test the classifier, so the algorithm does not know the answer and has to rely on its prior training to classify. With cross validation, you repeat the process 10 times, so each of 10 parts of the dataset gets to be the test set.
When you click on Start button, the training and testing of the classifier is done. As you can see in the results window, the ZeroR classified is not very good. It classified only 33% of the test data correctly. What if we try a different classification algorithm? Try choosing a RandomForest classified inside the tree folder, and run Start again. The results are much better! 95% of the data got classified correctly!
Hopefully, playing with the data has inspired you to go further exploring the datasets and algorithms. I am definitely going to talk more about details of how these algorithms work in the future posts. For now, have a wonderful week, play with some more datasets and classifiers, teaching your computer to be smart and be CodeBrave!