What’s Data Mining?

The goal of data mining is to gain actionable insights from an organization’s data: businesses decide what they can do based on what they know about the data they have.

Put another way, data mining is all about taking a huge amount of data and extracting insights from it, much like how physical mining extracts a small amount of precious metal from large piles of raw ore.
Data mining, however uses statistics, code, and machine learning algorithms instead of explosives and smelting. Many of those data mining tools are provided by the Python programming language and its extensive ecosystem of third-party modules.

Tools of the Trade

Let’s get acquainted with some of the available data mining tools for Python, which we’ll use to do a very basic analysis of a publicly available dataset provided by the FBI:

different tools and techniques used for data mining

Pandas: a Python module for working with data (particularly in table form) which is fast and flexible.

Matplotlib: a plotting library for Python.

Seaborn: a data visualization library for Python, based on matplotlib.

Jupyter: a web app which allows users to create, run and share documents that contain live code and is very popular among data scientists. Python’s one of the supported languages.

Statsmodels: Python module for statistics

The rest of this article assumes you have Python and the above software installed. Please refer to the documentation for each tool in order to install it on your system.

Take a Byte Out of Crime

First, download the Excel file for the Offenses Known to Law Enforcement by City for the state of California. Then, fire up a new Jupyter notebook.

Before we do any actual analysis, we’ll have to import Pandas, Matplotlib’s pyplot module, and Seaborn and then tell Jupyter to display our graphics inline so that we can view them in our notebook:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%pylab inline

Next, let’s use Pandas to read in the Excel file into a dataframe object:

In [2]:
df = pd.read_excel(“../table_8_offenses_known_to_law_enforcement_california_by_city_2013.xls”)

Just to make sure that all the data in the Excel file cleanly imported into our dataframe, let’s look at the top of it:

In [3]:
df.head()

Out [3]:

Data Mining: Excel file imported into our dataframe

Since the title of the table is in the spreadsheet itself, that title got swept up during the import process, resulting in NaN’s in the top of almost all the columns. Real data very often requires at least some cleaning before you can process and analyze it.

After making a copy of this Excel file with the title cells deleted, replacing whitespace with underscores in almost all the column titles which are more than one word, and importing this modified copy, we end up with a clean dataframe:

In [4]:
df = pd.read_excel(“../table_8_offenses_known_to_law_enforcement_california_by_city_2013 (cleaned).xls”)

Out [4]:

Data mining: cleaning dataframe

Our dataframe object has a handy method for displaying basic statistical information about each of the columns with numerical data:

In [5]:
df.describe()

Out [5]:

dataframe object

Next, we’ll plot a linear regression on top of a plot of city population vs. property crime, using Seaborn:

In[6]:
sns.regplot(x=”Population”, y=”Property_crime”, data=df,fit_reg= True)
plt.show()

Out[6]:

Data Mining: city population vs. property crime

For something much more quantitative, lets use the Ordinary Least Squares (OLS) module from Statsmodels to produce a summary of that regression:

In[7]:
from statsmodels.formula.api import ols
m = ols(‘Property_crime ~ Population’,df).fit()
print m.summary()

Out [7]:

Ordinary Least Squares (OLS) module

Not surprisingly, there’s a strong correlation between the number of people in a town and the total number of reported property crimes for that town.

Python’s ease of use, coupled with many its many powerful modules, make it a versatile tool for data mining and analysis, especially for those looking for the gold in their mountains of data.