• ankitrathi

Data Science: The Complete Reference

In Data Science, great books have been written, excellent courses have been launched, yet I feel there is nothing like ‘The Complete Reference’ or ‘Hitchhiker’s Guide To’ kind of reading or reference material. You can consider this post as a reference for any topic in Data Science (now AI).

Note: If you feel there is a topic not covered or there are better references available for certain topics, just let me know.

I have been working in data & technology for last 13 years and in my initial years, I really loved the books named like ‘The Complete Reference’ or ‘Hitchhiker’s Guide To’. These books used to cover everything under the umbrella of the language/technology to sufficient depth so that the reader can start working and explore each topic further himself.

My idea of this post is to cover everything under the sun for Data Science field to a decent depth and connect these dots to put the big picture clear in the mind of the reader. Here I plan not to write everything from scratch, albeit I will refer existing articles & posts wherever I can.

Before continuing with this post, if you are loving the content, check out my post on ‘How to launch your DS/AI Career in 12 weeks?’

How to launch your DS/AI career in 12 weeks?

So lets start with the table of content for this post:

  • Context & Introduction

  • Data Science Prerequisites

  • Data Science Concepts

  • Machine Learning Algorithms

  • Deep Neural Networks

  • Data Science Process

  • Data Science Tools

  • Case Studies

  • Appendix

1. Context & Introduction

This section gives the context of this post and introduces you to data science field.

1.1 Context of ‘The Complete Reference’

So why I have written this long post, what purpose it solves, get the context if you haven’t before proceeding.

Data Science: The Complete Reference (Series)

1.2 Data Science Introduction

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.

This post touches upon the what, why & how part data science field.

Data Science Introduction

2. Data Science Prerequisites

This section covers the prerequisites you need to learn to work on data science projects/problems effectively.

2.1 Linear Algebra

Linear algebra is the branch of mathematics concerning linear equations such as linear functions such as and their representations through matrices and vector spaces. Linear algebra is central to almost all areas of mathematics.

In this post, you will get to know Linear Algebra in the context of Data Science.

Linear Algebra for Data Science

2.2 Multivariate Calculus

Multivariate calculus is the extension of calculus in one variable to calculus with functions of several variables: the differentiation and integration of functions involving multiple variables, rather than just one.

This post jots down the topics related to Multivariate Calculus you need to be aware of before working on Data Science projects/problems.

Multivariate Calculus for Data Science

2.3 Probability & Statistics

Probability is the measure of the likelihood that an event will occur. Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.

Probability & Statistics are important areas to be covered if you want to know how algorithms actually work. This post series intuitively covers all related topics.

Probability & Statistics for Data Science (Series)

2.4 Language/Tools

To work on data science problems, you need to learn some languages/tools. This post highlights relevant tools.

Top-Rated Development Tools for Machine Learning

3. Data Science Concepts

3.1 Terminology (AI, DS, ML, DL)

What is artificial intelligence, data science, machine learning & deep learning, how these terms differ? We keep hearing these terms frequently and interchangeabley, lets learn the terminology in this post.

Explaining the terms AI, ML, DL, DS

3.2 Supervised Learning (Classification, Regression)

Within data science, there are sub-fields which solve specific type of problems. Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.

Want to learn more about supervised learning, read this post.

Supervised Machine Learning: Regression Vs Classification

3.3 Unsupervised Learning (Clustering, Anomaly Detection)

Unsupervised learning is another sub-field where we apply machine learning without having a target to map with. Unsupervised learning is a type of machine learning algorithm used to draw inferences from data-sets consisting of input data without labeled responses.

You get to learn more about unsupervised learning in this post.

Unsupervised learning demystified

3.4 Reinforcement Learning

Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

Here is the post that introduces to reinforcement learning.

An introduction to Reinforcement Learning

3.5 Natural Language Processing (NLP)

Natural language processing is a sub-field of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data.

Learn more about natural language processing in this post.

Natural Language Processing is Fun!

3.6 Deep Learning (CNN, RNN, LSTM, GAN)

Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised.

Here is the quick guide to get you started with deep learning.

Want to know how Deep Learning works? Here’s a quick guide for everyone.

4. Machine Learning Algorithms

This section covers major machine learning algorithms used in data science space.

4.1 Decision Trees

A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

This post covers the intrinsic details of how decision tree works.

Decision Trees in Machine Learning

4.2 Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.

Lets learn more about random forests in this post.

The Random Forest Algorithm

4.3 Linear Regression

Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

Want to have a detailed view of linear regression? Refer this post.

Linear Regression — Detailed View

4.4 Bias-Variance Tradeoff

Bias occurs when an algorithm has limited flexibility to learn the true signal from a data-set. Variance refers to an algorithm’s sensitivity to specific sets of training data. An optimized data science model tries to find a balance between both.

Following post covers the bis-variance trade-off topic in detail.

Understanding the Bias-Variance Tradeoff

4.4 Regularization (L1/L2)

In data science, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization applies to objective functions in ill-posed optimization problems.

Lets learn more about regularization in this post.

Regularization in Machine Learning

4.5 Logistic Regression

The logistic model is a widely used statistical model that, in its basic form, uses a logistic function to model a binary dependent variable; many more complex extensions exist.

This post gives a fair understanding for logistic regression.

Understanding Logistic Regression

4.6 k-Nearest Neighbours

The k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.

This post covers this algorithm in context of machine learning.

Machine Learning Basics with the K-Nearest Neighbors Algorithm

4.7 Support Vector Machines

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

Learn more about support vector machines here.

Support Vector Machine — Introduction to Machine Learning Algorithms

4.8 k-Means Clustering

K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

This post covers k-means clustering algorithm from A to Z.

K-Means Clustering: From A to Z

4.9 Anomaly Detection

In data science, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

This post gives you an introduction to anomaly detection.

Introduction to Anomaly Detection

4.10 Neural Nets

Artificial neural networks are computing systems inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.

Know what, how & why of neural networks here.

Understanding Neural Networks: What, How and Why?

Types of Optimization Algorithms used in Neural Networks and Ways to Optimize Gradient Descent

5. Deep Neural Networks

This section highlights major deep learning frameworks used in data science space to solve various specific problems.

5.1 Deep Neural Networks

A deep neural network is a neural network with a certain level of complexity, a neural network with more than two layers. Deep neural networks use sophisticated mathematical modeling to process data in complex ways.

Lets learn more about deep neural networks in this post.

Introducing Deep Learning and Neural Networks — Deep Learning for Rookies (1)

5.2 Convolution Neural Networks

In deep learning, a convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.

This is an intuitive guide to convolutional neural networks.

An intuitive guide to Convolutional Neural Networks

5.3 Recurrent Neural Networks

A recurrent neural network is a class of artificial neural network where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state to process sequences of inputs.

To learn more about recurrent neural networks, follow this illustrated guide.

Illustrated Guide to Recurrent Neural Networks

5.4 LSTMs

Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They are explicitly designed to avoid the long-term dependency problem.

Want to know more about LSTMs and how they differ from RNNs? Refer this post.

Recurrent Neural Networks and LSTM

5.5 Autoencoders

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise.”

Get an inside view of autoencoders in this post.

Deep inside: Autoencoders

6. Data Science Process

This section touches upon main processes used in data science projects.

6.1 Business Understanding

Before you can even start on a data science project, it is critical that you understand the problem you are trying to solve. The data scientists should keep asking the why’s. They need to ensure that every decision made in the company is supported by data, and that it is highly probable to achieve results.

This post gives you a view on how to covert a business question to data science task.

From Business Question to Data Science Task

6.2 Data Understanding

The data understanding stage is like the brainstorming of data analysis. This is where you understand the patterns and bias in your data. It could involve pulling up and analyzing a random subset of the data using Pandas, plotting a histogram or distribution curve to see the general trend, or even creating an interactive visualization that lets you dive down into each data point and explore the story behind the outliers.

Following post provides a gentle introduction to EDA.

A Gentle Introduction to Exploratory Data Analysis

6.3 Data Preparation

Data preparation or wrangling is the most time-consuming step of all, this is especially true in big data projects, which often involve terabytes of data to work with.

You can refer this post for the comprehensive introduction to data wrangling.

A Comprehensive Introduction to Data Wrangling - Springboard Blog

6.4 Feature Engineering

Feature engineering is the process of using domain knowledge to transform your raw data into informative features that represent the business problem you are trying to solve. This stage will directly influence the accuracy of the predictive model you construct in the next stage.

Want to know more about feature engineering? Go through this post.

Feature Engineering: What Powers Machine Learning

6.5 Modeling & Validation

Predictive modeling is where the machine learning finally comes into your data science project. Based on the questions you asked in the business understanding stage, this is where you decide which model to pick for your problem. Once you’ve trained your model, it is critical that you evaluate its success. A process called k-fold cross validation is commonly used to measure the accuracy of a model.

Learn more about cross-validation here.


6.6 Deployment & Hosting

Training a model is one thing, but deploying your model to solve a business problem is different. Deploying into production can mean something different for each one of you.

For some people putting a model into production means having the model accessible by anyone who can use it to calculate, measure or see something. For other people means having the model do something or interact with customers.

Take a look at one of the best options for deploying your deep learning models in this post.

Deploying a Machine Learning Model as a REST API

6.7 Monitoring & Maintenance

Model monitoring & maintenance is not particularly a favorite activity for anyone, but it is something that you really must do, and in fact, plan for before you even build your model.

Lets understand why your models need maintenance in this post.

Why your Models need Maintenance

7. Data Science Tools

This section points towards major languages/tools used in data science field across industries.

7.1 SQL

Basic SQL For Data Analysis

7.2 Python

Python for Data Science: From Scratch (Part I)

7.3 R

5 Lines of Code to Convince You to Learn R

7.4 Keras

Introduction to Deep Learning with Keras

8. Case Studies

This section covers the basic case studies one can work on to get his hands wet in data science problems.

8.1 Classification

In this challenge, Kaggle asks you to complete the analysis of what sorts of people were likely to survive in Titanic disaster. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Titanic: Machine Learning from Disaster

8.2 Regression

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

House Prices: Advanced Regression Techniques

8.3 Natural Language Processing

This Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive.

Movie Review Sentiment Analysis (Kernels Only)

8.4 Deep Neural Network

MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.

Digit Recognizer

9. Appendix

This section highlights some additional but important topics which a data science starter should be aware of.

9.1 Data Basics

Data Basics

9.2 Data Sources

A Beginner’s Guide to Data Engineering — Part I

9.3 Data Pipelines

Building a Data Pipeline from Scratch

Ankit Rathi is an AI architect, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

Why don’t you connect with Ankit on Twitter, LinkedIn or Instagram



T: +91 9891XXX969  

Follow me

  • Facebook Clean
  • Twitter Clean
  • White Google+ Icon

©  2020  Ankit Rathi