• ankitrathi

Data Science Pipeline in Python

Data Science Pipeline in Python

Visit now to:
— to read my blog posts on various topics of AI/ML
— to keep a tab on latest & relevant news/articles daily from AI/ML world
— to refer free & useful AI/ML resources
— to buy my books on discounted price
— to know more about me and what I am up to these days

We all look for code-snippets while working on data science/machine learning problems, be it a case study, a competition or a project. Many data scientists have their own repositories of code-snippets to save time and focus on the problem itself rather than juggling between codes over internet.

Today, I am going to share with you my own python code repository which I have arranged and generalized into a kind of data science pipeline, from importing libraries to tuning the hyper-parameters, everything is there, you just need to customize it a bit based on your own data-set or problem.

Please note that its a basic pipeline, you still might need to write some code to implement something specific to your problem. While I have pasted the code-snippets in this blog itself, you can refer my python notebook on GitHub to get properly formatted code.

Happy coding….

  1. Import libraries

# import basic libraries import numpy as np  import pandas as pd  import warnings warnings.filterwarnings(‘ignore’)
# import plot libraries import seaborn as sns sns.set_palette(‘Set2’) import matplotlib.pyplot as plt %matplotlib inline
# import ml libraries from sklearn.metrics import confusion_matrix from sklearn.preprocessing import LabelEncoder, MinMaxScaler from sklearn.model_selection import train_test_split, cross_val_score from sklearn import linear_model, datasets from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.metrics import accuracy_score from sklearn.svm import LinearSVC, SVC
# list number of files import os print(os.listdir(“data”))
  1. Read data

# read data
from subprocess import check_output print(check_output([“ls”, “../input/”]).decode(“utf8”))
train = pd.read_csv(“data/train.csv”) test = pd.read_csv(“data/test.csv”)
  1. Check shape

# check shape
print(“Train rows and columns : “, train.shape) print(“Test rows and columns : “, test.shape)
  1. Check column types

# check column types
ctype = train.dtypes.reset_index() ctype.columns = [“Count”, “Column Type”] ctype.groupby(“Column Type”).aggregate(‘count’).reset_index()
  1. Display data header

# display data header
  1. Numerical data distribution

# numerical data distribution
  1. Categorical data distribution

# categorical data distribution
  1. Check missing values

# check missing values missing_df = train.isnull().sum(axis=0).reset_index() missing_df.columns = [‘column_name’, ‘missing_count’] missing_df = missing_df[missing_df[‘missing_count’]>0] missing_df = missing_df.sort_values(by=’missing_count’) missing_df
  1. Impute/treat missing values

# impute/treat missing values
train[‘col1’] = train[‘col1’].fillna(train[‘col1’].value_counts().index[0]) # for categorical
train[‘col2’].fillna(train[‘col2’].mean(), inplace=True) # for numerical (mean or median)
  1. Check outliers

# check ouliers
train.ix[np.abs(train.col-fmean) > (3*fstd), ‘col’] # upper outliers train.ix[np.abs(train.col-fmean) < -(3*fstd), ‘col’] # lower outliers
  1. Treat outliers

# treat outliers
fmean = train.col.mean() fstd = train.col.std() train.ix[np.abs(train.col-fmean) > (3*fstd), ‘col’] = fmean + (3*fstd) # treat upper outliers train.ix[np.abs(train.col-fmean) < -(3*fstd), ‘col’] = -(fmean + (3*fstd)) # treat lower outliers
  1. Univariate analysis

# univariate analysis
# histogram of numerical column plt.figure(figsize=(12,8)) sns.distplot(train[“num”].values, bins=10, kde=False) plt.xlabel(‘num’, fontsize=12) plt.title(“num histogram”, fontsize=14)
# charts of categorical column labels = sizes = [train[‘cat’].value_counts()[1],  train[‘cat’].value_counts()[0]  ] # pie plot for categorical column fig1, ax1 = plt.subplots() ax1.pie(sizes, labels=labels, autopct=’%1.1f%%’, shadow=True) ax1.axis(‘equal’)
# bar plot for categorical column fig2, ax2 = plt.subplots() sns.countplot(“col”, data=train)
  1. Bi-variate analysis

# bivariate analysis
sns.barplot(x=’cat1′, y=’cat2′, data=train) # categorical vs categorical
sns.violinplot(x=’cat’, y=’num’, data=train) # categorical vs numerical
sns.regplot(x=”num1″, y=”num2″, data=train) # numerical vs numerical
  1. Multivariate analysis

# multivariate analysis
temp = train[cols_to_use] corrmat = temp.corr(method=’spearman’) f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(corrmat, vmax=1., square=True, cmap=”YlGnBu”, annot=True) plt.title(“numerical variables correlation map”, fontsize=15)
  1. Split data

# split data
y = X = train.drop(‘target’, axis=1, inplace=False)
X_train, X_val, y_train, y_val = train_test_split(X, y,random_state = 123)
  1. Feature engineering on train/valid

# feature engineering on train/valid
label = LabelEncoder() X_train[‘cat’] = label.fit_transform(X_train[‘cat’]) # for categorical data
scaler = MinMaxScaler() # for numerical data  X_train = scaler.transform(X_train) X_val = scaler.transform(X_val)
  1. Build model on train

# build model on train
model = linear_model.LogisticRegression(C=1e5) # RandomForestClassifier(), SVC(), RandomForestRegressor() etc, y_train)
y_pred = model.predict(X_val)
model.score(X_train, y_train)
  1. Evaluate on valid

# evaluate on valid
confusion_matrix(y_val, y_pred) # for categorical target
mean_squared_error(y_true, y_pred) # for numerical target
  1. K-fold cross-validation

# k-fold cross-validation
model = svm.SVC(kernel=’linear’, C=1)
scores = cross_val_score(model, X_train, y_train, cv=5)
print(“Score: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2))
  1. Hyper-parameter tuning

for gamma in [0.001, 0.01, 0.1, 1, 10, 100]:  for C in [0.001, 0.01, 0.1, 1, 10, 100]:  # for each combination of parameters,  # train an SVC  svm = SVC(gamma=gamma, C=C)  # perform cross-validation  scores = cross_val_score(svm, X_train, y_train, cv=5)  # compute mean cross-validation accuracy  score = np.mean(scores)  # if we got a better score, store the score and parameters  if score > best_score:  best_score = score  best_parameters = {‘C’: C, ‘gamma’: gamma} # rebuild a model on the combined training and validation set svm = SVC(**best_parameters), y_train)

If you liked this post, have a look at another where I talk about ‘How to launch your DS/AI career in 12 weeks?’:

Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also, feel free to visit my webpage

#DataPipeline #DataScience #Github #Python3

1 view


T: +91 9891XXX969  

Follow me

  • Facebook Clean
  • Twitter Clean
  • White Google+ Icon

©  2020  Ankit Rathi