Happy New Year everyone! I took a long break from blogging with all the vacations and the end of the year activities, but now I am happy to be back in the loop to deliver more insights about data science. At Microsoft HoloLens I have been diving head-first into the fun world of data, and being a software engineer, I am continuously looking for ways in which to improve code and structure the data science projects with better reusability.
Recycle your code
One of the first things I noticed when creating my first Jupyter notebook data science project was I continued performing the same analysis on different parts of the data. Using the software reusability tenet, I quickly converted the repeated parts of code into well-named and well-documented functions and imported them into my Jupyter notebooks using the %run iPython magic command which runs the Python code in the file you provide it:
Some of those functions were mission critical, i.e. decisions of the entire group or company would be made based on the numbers and insights I provided them. There could be no room for error. While looking around, I did not see a lot of Jupyter notebooks written as software should be written – with tests. Fortunately, the work we just did in the previous step where we extracted reusable code blocks into their own functions will make writing tests incredibly easy. If you do not know how to create unit tests or need a refresher, I would recommend taking a look at the Art of Unit Testing book by Roy Osherove. It saved my skin once, maybe it will save yours. You could go even further and require the tests to be run any time you are committing code to git, but I will talk about that in one of the future blog posts.
Find a way to get your data back
Imagine you performed an experiment. You found a really cool data insight. You committed the Jupyter notebook code to git and continued your life with cooler, fresher, more exciting projects. A couple months later the team comes back to you: “We were wondering, did the statistic X change since you ran the experiment way back when?” You run back to the Jupyter notebook only to discover that while the code still works, you have no idea how you munged and got the data to run it. You have to spend an entire day reinventing the wheel of data extraction, which is beyond inefficient.
A lot of blogs out there advice to put the correctly shaped data together with the code you are committing. While this may work just fine for small public databases, when you have a couple hundred thousand rows with hundreds of columns of private user data, such practice is not an option. Imagine cloning a git repo with hundreds of projects full of 1GB data files per project. I might as well go and play Witcher 3 now. Also, remember, your data may have to deal with NGP compliance and HIPAA and other blah blah important boring government edicts.
A much better way to deal with creating reusable data is to put all your data extraction and data munging into one script. This script could pull from your Hadoop storage, or a SQL server, or a publicly stored file and shape the data according to the needs of your insights extraction machine. Viola!
I could continue rambling about the perks of software engineering practices in data science for all eternity, but there is only much time in the day. For now remember to place your reusable python code into functions, treat those functions like libraries and write unit tests for them, and finally, make sure you commit to git the steps you took to extract the data for your project because in the end this will save you hours of work when you have to rerun your experiments. Go and play with data and be CodeBrave!