Data Science: The Complete Reference (Series)
In Data Science, great books have been written, excellent courses have been launched, yet I feel there is nothing like ‘The Complete Reference’ or ‘Hitchhiker’s Guide To’ kind of reading or reference material. You can consider this post as a reference for any topic in Data Science (now AI).
Note: If you feel there is a topic not covered or there are better references available for certain topics, just let me know.
Visit ankitrathi.com now to:
— to read my blog posts on various topics of AI/ML
— to keep a tab on latest & relevant news/articles daily from AI/ML world
— to refer free & useful AI/ML resources
— to buy my books on discounted price
— to know more about me and what I am up to these days
I have been working in data & technology for last 13 years and in my initial years, I really loved the books named like ‘The Complete Reference’ or ‘Hitchhiker’s Guide To’. These books used to cover everything under the umbrella of the language/technology to sufficient depth so that the reader can start working and explore each topic further himself.
My idea of this post is to cover everything under the sun for Data Science field to a decent depth and connect these dots to put the big picture clear in the mind of the reader. Here I plan not to write everything from scratch, albeit I will refer existing articles & posts wherever I can.
Before continuing with this post, if you are loving the content, check out my post on ‘How to launch your DS/AI Career in 12 weeks?’
So lets start with the table of content for this post:
Context & Introduction
Data Science Prerequisites
Data Science Concepts
Machine Learning Algorithms
Deep Neural Networks
Data Science Process
Data Science Tools
1. Context & Introduction
This section gives the context of this post and introduces you to data science field.
1.1 Context of ‘The Complete Reference’
So why I have written this long post, what purpose it solves, get the context if you haven’t before proceeding.
1.2 Data Science Introduction
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.
This post touches upon the what, why & how part data science field.
2. Data Science Prerequisites
This section covers the prerequisites you need to learn to work on data science projects/problems effectively.
2.1 Linear Algebra
Linear algebra is the branch of mathematics concerning linear equations such as linear functions such as and their representations through matrices and vector spaces. Linear algebra is central to almost all areas of mathematics.
In this post, you will get to know Linear Algebra in the context of Data Science.
2.2 Multivariate Calculus
Multivariate calculus is the extension of calculus in one variable to calculus with functions of several variables: the differentiation and integration of functions involving multiple variables, rather than just one.
This post jots down the topics related to Multivariate Calculus you need to be aware of before working on Data Science projects/problems.
2.3 Probability & Statistics
Probability is the measure of the likelihood that an event will occur. Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation.
Probability & Statistics are important areas to be covered if you want to know how algorithms actually work. This post series intuitively covers all related topics
To work on data science problems, you need to learn some languages/tools. This post highlights relevant tools.
3. Data Science Concepts
3.1 Terminology (AI, DS, ML, DL)
What is artificial intelligence, data science, machine learning & deep learning, how these terms differ? We keep hearing these terms frequently and interchangeabley, lets learn the terminology in this post.
3.2 Supervised Learning (Classification, Regression)
Within data science, there are sub-fields which solve specific type of problems. Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.
Want to learn more about supervised learning, read this post.
3.3 Unsupervised Learning (Clustering, Anomaly Detection)
Unsupervised learning is another sub-field where we apply machine learning without having a target to map with. Unsupervised learning is a type of machine learning algorithm used to draw inferences from data-sets consisting of input data without labeled responses.
You get to learn more about unsupervised learning in this post.
3.4 Reinforcement Learning
Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Here is the post that introduces to reinforcement learning.
3.5 Natural Language Processing (NLP)
Natural language processing is a sub-field of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data.
Learn more about natural language processing in this post.
3.6 Deep Learning (CNN, RNN, LSTM, GAN)
Deep learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised.
Here is the quick guide to get you started with deep learning.
4. Machine Learning Algorithms
This section covers major machine learning algorithms used in data science space.
4.1 Decision Trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
This post covers the intrinsic details of how decision tree works.
4.2 Random Forest
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.
Lets learn more about random forests in this post.
4.3 Linear Regression
Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
Want to have a detailed view of linear regression? Refer this post.
4.4 Bias-Variance Tradeoff
Bias occurs when an algorithm has limited flexibility to learn the true signal from a data-set. Variance refers to an algorithm’s sensitivity to specific sets of training data. An optimized data science model tries to find a balance between both.
Following post covers the bis-variance trade-off topic in detail.
4.4 Regularization (L1/L2)
In data science, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization applies to objective functions in ill-posed optimization problems.
Lets learn more about regularization in this post.
4.5 Logistic Regression
The logistic model is a widely used statistical model that, in its basic form, uses a logistic function to model a binary dependent variable; many more complex extensions exist.
This post gives a fair understanding for logistic regression.
4.6 k-Nearest Neighbours
The k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space.
This post covers this algorithm in context of machine learning.
4.7 Support Vector Machines
In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
Learn more about support vector machines here.
4.8 k-Means Clustering
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
This post covers k-means clustering algorithm from A to Z.
4.9 Anomaly Detection
In data science, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
This post gives you an introduction to anomaly detection.
4.10 Neural Nets
Artificial neural networks are computing systems inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.
Know what, how & why of neural networks here.
5. Deep Neural Networks
This section highlights major deep learning frameworks used in data science space to solve various specific problems.
5.1 Deep Neural Networks
A deep neural network is a neural network with a certain level of complexity, a neural network with more than two layers. Deep neural networks use sophisticated mathematical modeling to process data in complex ways.
Lets learn more about deep neural networks in this post.
5.2 Convolution Neural Networks
In deep learning, a convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery. CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing.
This is an intuitive guide to convolutional neural networks.
5.3 Recurrent Neural Networks
A recurrent neural network is a class of artificial neural network where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state to process sequences of inputs.
To learn more about recurrent neural networks, follow this illustrated guide.
Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies. They are explicitly designed to avoid the long-term dependency problem.
Want to know more about LSTMs and how they differ from RNNs? Refer this post.
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise.”
Get an inside view of autoencoders in this post.
6. Data Science Process
This section touches upon main processes used in data science projects.
6.1 Business Understanding
Before you can even start on a data science project, it is critical that you understand the problem you are trying to solve. The data scientists should keep asking the why’s. They need to ensure that every decision made in the company is supported by data, and that it is highly probable to achieve results.
This post gives you a view on how to covert a business question to data science task.
6.2 Data Understanding
The data understanding stage is like the brainstorming of data analysis. This is where you understand the patterns and bias in your data. It could involve pulling up and analyzing a random subset of the data using Pandas, plotting a histogram or distribution curve to see the general trend, or even creating an interactive visualization that lets you dive down into each data point and explore the story behind the outliers.
Following post provides a gentle introduction to EDA.
6.3 Data Preparation
Data preparation or wrangling is the most time-consuming step of all, this is especially true in big data projects, which often involve terabytes of data to work with.
You can refer this post for the comprehensive introduction to data wrangling.
6.4 Feature Engineering
Feature engineering is the process of using domain knowledge to transform your raw data into informative features that represent the business problem you are trying to solve. This stage will directly influence the accuracy of the predictive model you construct in the next stage.
Want to know more about feature engineering? Go through this post.
6.5 Modeling & Validation
Predictive modeling is where the machine learning finally comes into your data science project. Based on the questions you asked in the business understanding stage, this is where you decide which model to pick for your problem. Once you’ve trained your model, it is critical that you evaluate its success. A process called k-fold cross validation is commonly used to measure the accuracy of a model.
Learn more about cross-validation here.
6.6 Deployment & Hosting
Training a model is one thing, but deploying your model to solve a business problem is different. Deploying into production can mean something different for each one of you.
For some people putting a model into production means having the model accessible by anyone who can use it to calculate, measure or see something. For other people means having the model do something or interact with customers.
Take a look at one of the best options for deploying your deep learning models in this post.
6.7 Monitoring & Maintenance
Model monitoring & maintenance is not particularly a favorite activity for anyone, but it is something that you really must do, and in fact, plan for before you even build your model.
Lets understand why your models need maintenance in this post.
7. Data Science Tools
This section points towards major languages/tools used in data science field across industries.
8. Case Studies
This section covers the basic case studies one can work on to get his hands wet in data science problems.
In this challenge, Kaggle asks you to complete the analysis of what sorts of people were likely to survive in Titanic disaster. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
8.3 Natural Language Processing
This Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive.
8.4 Deep Neural Network
MNIST (“Modified National Institute of Standards and Technology”) is the de facto “hello world” dataset of computer vision. In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images.
This section highlights some additional but important topics which a data science starter should be aware of.
9.1 Data Basics
9.2 Data Sources
9.3 Data Pipelines
Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also, feel free to visit my webpage https://ankitrathi.com