Search
  • ankitrathi

Data Science Pipeline in Python


We all look for code-snippets while working on data science/machine learning problems, be it a case study, a competition or a project. Many data scientists have their own repositories of code-snippets to save time and focus on the problem itself rather than juggling between codes over internet.


Today, I am going to share with you my own python code repository which I have arranged and generalized into a kind of data science pipeline, from importing libraries to tuning the hyper-parameters, everything is there, you just need to customize it a bit based on your own data-set or problem.


Please note that its a basic pipeline, you still might need to write some code to implement something specific to your problem. While I have pasted the code-snippets in this blog itself, you can refer my python notebook on GitHub to get properly formatted code.


Happy coding….


ankitrathi169/Data-Science-with-Python


Import libraries

# import basic libraries import numpy as np  import pandas as pd  import warnings warnings.filterwarnings(‘ignore’)
# import plot libraries import seaborn as sns sns.set_palette(‘Set2’) import matplotlib.pyplot as plt %matplotlib inline
# import ml libraries from sklearn.metrics import confusion_matrix from sklearn.preprocessing import LabelEncoder, MinMaxScaler from sklearn.model_selection import train_test_split, cross_val_score from sklearn import linear_model, datasets from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.metrics import accuracy_score from sklearn.svm import LinearSVC, SVC
# list number of files import os print(os.listdir(“data”))

Read data

# read data
from subprocess import check_output print(check_output([“ls”, “../input/”]).decode(“utf8”))
train = pd.read_csv(“data/train.csv”) test = pd.read_csv(“data/test.csv”)

Check shape

# check shape
print(“Train rows and columns : “, train.shape) print(“Test rows and columns : “, test.shape)

Check column types

# check column types
ctype = train.dtypes.reset_index() ctype.columns = [“Count”, “Column Type”] ctype.groupby(“Column Type”).aggregate(‘count’).reset_index()
train.info()

Display data header

# display data header
train.head()

Numerical data distribution

# numerical data distribution
train.describe()

Categorical data distribution

# categorical data distribution
train.describe(include=[‘O’])

Check missing values

# check missing values missing_df = train.isnull().sum(axis=0).reset_index() missing_df.columns = [‘column_name’, ‘missing_count’] missing_df = missing_df[missing_df[‘missing_count’]>0] missing_df = missing_df.sort_values(by=’missing_count’) missing_df

Impute/treat missing values

# impute/treat missing values
train[‘col1’] = train[‘col1’].fillna(train[‘col1’].value_counts().index[0]) # for categorical
train[‘col2’].fillna(train[‘col2’].mean(), inplace=True) # for numerical (mean or median)

Check outliers

# check ouliers
train.ix[np.abs(train.col-fmean) > (3*fstd), ‘col’] # upper outliers train.ix[np.abs(train.col-fmean) < -(3*fstd), ‘col’] # lower outliers

Treat outliers

# treat outliers
fmean = train.col.mean() fstd = train.col.std() train.ix[np.abs(train.col-fmean) > (3*fstd), ‘col’] = fmean + (3*fstd) # treat upper outliers train.ix[np.abs(train.col-fmean) < -(3*fstd), ‘col’] = -(fmean + (3*fstd)) # treat lower outliers

Univariate analysis

# univariate analysis
# histogram of numerical column plt.figure(figsize=(12,8)) sns.distplot(train[“num”].values, bins=10, kde=False) plt.xlabel(‘num’, fontsize=12) plt.title(“num histogram”, fontsize=14) plt.show()
# charts of categorical column labels = train.cat.unique() sizes = [train[‘cat’].value_counts()[1],  train[‘cat’].value_counts()[0]  ] # pie plot for categorical column fig1, ax1 = plt.subplots() ax1.pie(sizes, labels=labels, autopct=’%1.1f%%’, shadow=True) ax1.axis(‘equal’)
# bar plot for categorical column fig2, ax2 = plt.subplots() sns.countplot(“col”, data=train) plt.show()

Bi-variate analysis

# bivariate analysis
sns.barplot(x=’cat1', y=’cat2', data=train) # categorical vs categorical
sns.violinplot(x=’cat’, y=’num’, data=train) # categorical vs numerical
sns.regplot(x=”num1", y=”num2", data=train) # numerical vs numerical

Multivariate analysis

# multivariate analysis
temp = train[cols_to_use] corrmat = temp.corr(method=’spearman’) f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corrmat, vmax=1., square=True, cmap=”YlGnBu”, annot=True) plt.title(“numerical variables correlation map”, fontsize=15) plt.show()

Split data

# split data
y = train.target X = train.drop(‘target’, axis=1, inplace=False)
X_train, X_val, y_train, y_val = train_test_split(X, y,random_state = 123)

Feature engineering on train/valid

# feature engineering on train/valid
label = LabelEncoder() X_train[‘cat’] = label.fit_transform(X_train[‘cat’]) # for categorical data
scaler = MinMaxScaler() # for numerical data scaler.fit(X_train)  X_train = scaler.transform(X_train) X_val = scaler.transform(X_val)

Build model on train

# build model on train
model = linear_model.LogisticRegression(C=1e5) # RandomForestClassifier(), SVC(), RandomForestRegressor() etc
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
model.score(X_train, y_train)

Evaluate on valid

# evaluate on valid
confusion_matrix(y_val, y_pred) # for categorical target
mean_squared_error(y_true, y_pred) # for numerical target

K-fold cross-validation

# k-fold cross-validation
model = svm.SVC(kernel=’linear’, C=1)
scores = cross_val_score(model, X_train, y_train, cv=5)
print(“Score: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2))

Hyper-parameter tuning

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:  for C in [0.001, 0.01, 0.1, 1, 10, 100]:  # for each combination of parameters,  # train an SVC  svm = SVC(gamma=gamma, C=C)  # perform cross-validation  scores = cross_val_score(svm, X_train, y_train, cv=5)  # compute mean cross-validation accuracy  score = np.mean(scores)  # if we got a better score, store the score and parameters  if score > best_score:  best_score = score  best_parameters = {‘C’: C, ‘gamma’: gamma}
# rebuild a model on the combined training and validation set svm = SVC(**best_parameters) svm.fit(X_train, y_train)

If you liked this post, have a look at another where I talk about ‘How to launch your DS/AI career in 12 weeks?’:

How to launch your DS/AI career in 12 weeks?

Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on Twitter, LinkedIn or Instagram

363 views

Call

T: +91 9891XXX969  

Follow me

  • Facebook Clean
  • Twitter Clean
  • White Google+ Icon

©  2020  Ankit Rathi