Pandas is a very useful python package that is widely used for data analysis. However, it has so many functions, features and options that beginners are usually overwhelmed by what exactly to do and how. To complicate matters, the same task can be performed in many different ways depending on the situation. To help newcomers get started with this wonderful package, we will walk through an actual task, that most managers will easily relate to, and identify and explain the functions and options as and when we need to use them for the task at hand. This will help the reader identify a minimal, and core, set of functionality that can be applied to a wide range of similar task. Later on, they would be in a position to look for other functions, not covered in this minimal set, that may be required for future work.
In this exercise, we demonstrate the use of basic Pandas commands to carry out an analysis of sales data that is available at Kaggle. For ease of access, without having to login to Kaggle, the same data is made available at the author's github repository.
There are, literally, hundreds of functions available in the pandas module and an exhaustive list of all such functions would be available in the official documentation, that should one to get started. For the charts and graphs, one may look at the visualisation section of the documentation. However it is faster, easier and certainly far less boring to work one's way through a reasonably complex data analysis problem and learn about certain functions that are used more than others, as and when you get to use it. Or Just In Time.
The following Pandas functions are demonstrated in this exercise
* df.copy()
* df.plot()
* df.drop()
* df.unique()
* df.groupby().sum()
* df.loc[..]
* df.cumsum()
* df[ list ]
* df.dtype
* df.shape
* df.columns
* df.index
* pd.to_numeric()
* pd.to_datetime()
* pd.read_csv()
* pd.pivot_table()
* pd.cut()
* list()
* str.replace()
* dt.strftime()
* .sort_values()
* .to_frame()
* .round()
* .isin(list)
* len()
Four years of sales data -- 51000 rows x 21 columns -- is available as a CSV file. Analysing so much data in spreadsheet is not convenient which is why we are using Python Pandas. The actual code is available in a Jupyter notebook that can be connected to and used on Google Colab VM. Readers are requried open this notebook by clicking on the blue 'open with colab' button and execute the commands that are discussed in this tutorial.
Do read the comments before each command to understand what is being and why.
Image credit Pandas for Data Science (Learning Path) – Real Python |