Book Title
Welcome
Colophone
License
Contributors
I Background and Preparation
1
Programming Bootcamp
1.1
Programming fundamentals
1.1.1
Code components
1.1.2
Designing with pseudo-code
1.1.3
From pseudo-code to code that runs
1.2
Basics of
R
1.2.1
Why Use R?
1.2.2
Installing R / R Studio
1.2.3
Test test test
1.2.4
Customize your RStudio
1.2.5
Elements of code in base
R
1.2.6
tidyverse
1.3
Basics of
python
1.3.1
Elements of code in
python
1.3.2
pandas
and
numpy
1.4
Other programming frameworks
1.5
Short Examples - Programming in R
1.5.1
Packages and Libraries
1.5.2
Commonly Used Libraries
1.5.3
Help and Documentation
1.5.4
The R Workspace
1.5.5
Simple Data Manipulation
1.5.6
Exploring Data
1.5.7
A Word About NAs
1.6
Additional resources
1.6.1
Upgrade R and/or RStudio
2
Overview of Multivariate Calculus
3
Overview of Linear Algebra
4
A Survey of Optimization
4.1
Single-Objective Optimization Problem
4.1.1
Feasible and Optimal Solutions
4.1.2
Infeasible and Unbounded Problems
4.1.3
Possible Tasks Involving Optimization Problems
4.2
Calculus Sidebar and Lagrange Multipliers
4.3
Classification of Optimization Problems and Types of Algorithms
4.3.1
Classification
4.3.2
Algorithms
4.4
Linear Programming
4.4.1
Linear Programming Duality
4.4.2
Methods for Solving Linear Programming Problems
4.5
Mixed-Integer Linear Programming
4.5.1
Cutting Planes
4.6
A Sample of Useful Modeling Techniques
4.6.1
Activation
4.6.2
Disjunction
4.6.3
Soft Constraints
4.7
Software Solvers
5
Overview of Probability Theory
5.1
Basics
5.1.1
Sample Spaces and Events
5.1.2
Counting Techniques
5.1.3
Ordered Samples
5.1.4
Unordered Samples
5.1.5
Probability of an Event
5.1.6
Conditional Probability and Independent Events
5.1.7
Bayes’ Theorem
5.2
Discrete Distributions
5.2.1
Random Variables and Distributions
5.2.2
Expectation of a Discrete Random Variable
5.2.3
Binomial Distributions
5.2.4
Geometric Distributions
5.2.5
Negative Binomial Distribution
5.2.6
Poisson Distributions
5.2.7
Other Discrete Distributions
5.3
Continuous Distributions
5.3.1
Continuous Random Variables
5.3.2
Expectation of a Continuous Random Variable
5.3.3
Normal Distributions
5.3.4
Exponential Distributions
5.3.5
Gamma Distributions
5.3.6
Normal Approximation of the Binomial Distribution
5.3.7
Other Continuous Distributions
5.4
Joint Distributions
5.5
Central Limit Theorem and Sampling Distributions
5.5.1
Sampling Distributions
5.5.2
Central Limit Theorem
5.5.3
Sampling Distributions (Reprise)
6
Introductory Statistical Analysis
7
A Linear Regression Cheatsheet
8
Analysis of Variance and Design of Experiments
9
Survey Sampling Methods
10
Overview of Simulation Methods
11
A Primer of Times Series and Forecasting
12
Non-Technical Aspects of Quantitative Work
II Introduction to Data Science
13
Data Science Basics
13.1
Introduction
13.1.1
What Is Data?
13.1.2
From Objects and Attributes to Datasets
13.1.3
Data in the News
13.1.4
The Analog/Digital Data Dichotomy
13.2
Conceptual Frameworks for Data Work
13.2.1
Three Modeling Strategies
13.2.2
Information Gathering
13.2.3
Cognitive Biases
13.3
Ethics in the Data Science Context
13.3.1
The Need for Ethics
13.3.2
What Is/Are Ethics?
13.3.3
Ethics and Data Science
13.3.4
Guiding Principles
13.3.5
The Good, the Bad, and the Ugly
13.4
Analytics Workflow
13.4.1
The “Analytical” Method
13.4.2
Data Collection, Storage, Processing, and Modeling
13.4.3
Model Assessment and Life After Analysis
13.4.4
Automated Data Pipelines
13.5
Non-Technical Aspects of Data Work
13.5.1
The Data Science Framework
13.5.2
Multiple “I”s Approach to Quantitative Work
13.5.3
Roles and Responsibilities
13.5.4
Non-Technical Data Cheat Sheet
13.6
Getting Insight From Data
13.6.1
Asking the Right Questions
13.6.2
Structuring and Organizing Data
13.6.3
Basic Data Analysis Techniques
13.6.4
Common Statistical Procedures in
R
13.6.5
Quantitative Methods
14
Data Preparation
14.1
Introduction
14.2
General Principles
14.2.1
Approaches to Data Cleaning
14.2.2
Pros and Cons
14.2.3
Tools and Methods
14.3
Data Quality
14.3.1
Common Error Sources
14.3.2
Detecting Invalid Entries
14.4
Missing Values
14.4.1
Missing Value Mechanisms
14.4.2
Imputation Methods
14.4.3
Multiple Imputation
14.5
Anomalous Observations
14.5.1
Anomaly Detection
14.5.2
Outlier Tests
14.5.3
Visual Outlier Detection
14.6
Data Transformations
14.6.1
Common Transformations
14.6.2
Box-Cox Transformations
14.6.3
Scaling
14.6.4
Discretizing
14.6.5
Creating Variables
14.7
Data Wrangling
15
Data Visualization
15.1
Data and Charts
15.1.1
Pre-Analysis Uses
15.1.2
Presenting Results
15.1.3
Multivariate Elements in Charts
15.1.4
Visualization Catalogue
15.2
A Word About Accessibility
15.3
Fundamental Principles of Analytical Design
15.3.1
Comparisons
15.3.2
Causality, Mechanism, Structure, Explanation
15.3.3
Multivariate Analysis
15.3.4
Integration of Evidence
15.3.5
Documentation
15.3.6
Content Counts Most of All
15.4
Basic Visualizations in
R
15.4.1
Scatterplots
15.4.2
Barplots
15.4.3
Histograms
15.4.4
Curves
15.4.5
Boxplots
15.4.6
Other Examples
15.4.7
Exercises
15.5
Introduction to Dashboards
15.5.1
Dashboard Fundamentals
15.5.2
Dashboard Structure
15.5.3
Dashboard Design
15.5.4
Examples
15.6
ggplot2
Visualizations in
R
15.6.1
Basics of
ggplot2
’s Grammar
15.6.2
ggplot2
Miscellenea
15.6.3
Examples
16
Machine Learning 101
16.1
Statistical Learning
16.1.1
Types of Learning
16.1.2
Data Science and Machine Learning Tasks
16.2
Association Rules Mining
16.2.1
Overview
16.2.2
Generating Rules
16.2.3
The A Priori Algorithm
16.2.4
Validation
16.2.5
Case Study: Danish Medical Data
16.2.6
Toy Example: Titanic Dataset
16.3
Classification and Value Estimation
16.3.1
Overview
16.3.2
Classification Algorithms
16.3.3
Decision Trees
16.3.4
Performance Evaluation
16.3.5
Case Study: Minnesota Tax Audit
16.3.6
Toy Example: Kyphosis Dataset
16.4
Clustering
16.4.1
Overview
16.4.2
Clustering Algorithms
16.4.3
k-Means
16.4.4
Clustering Validation
16.4.5
Case Study: Livehoods
16.4.6
Toy Example: Iris Dataset
16.5
Issues and Challenges
16.5.1
Bad Data
16.5.2
Overfitting/Underfitting
16.5.3
Appropriateness and Transferability
16.5.4
Myths and Mistakes
16.6
R
Examples
16.6.1
Association Rules Mining: Titanic Dataset
16.6.2
Classification: Kyphosis Dataset
16.6.3
Clustering: Iris Dataset
17
Data Engineering and Data Management
18
Reporting and Deployment
III Intermediate Data Science
19
Web Scraping and Automated Data Collection
20
Text Mining and Text Analysis
21
Regression and Value Estimation
22
Spotlight on Classification
23
Spotlight on Clustering
24
Feature Selection and Dimension Reduction
IV Advanced Data Science
25
Anomaly Detection and Outlier Analysis
26
Bayesian Data Analysis
27
Network Data Analysis
28
Recommender Engines
29
An Introduction to Deep Learning
30
Natural Language Processing
V Special Topics in Data Science
31
Big Data and Parallel Computing
32
Reinforcement Learning
33
Computer Vision and Image Analysis
34
Data Science with Streams
35
Frequent Event Mining
36
Ranking and Ordering
VI Data Science Through Examples
37
Music Dataset
37.1
Spotify Music Datasets
38
Web Scraping and Automated Data Collection
39
Data Engineering and Pipelines
40
Data Understanding
40.1
Data Structure and Summarization
40.2
Data Preparation and Data Cleaning
40.3
Exploratory Data Analyis
41
Feature Selection and Dimension Reduction
42
Regression and Value Estimation
43
Machine Learning Tasks
43.1
Association Rules Mining
43.2
Clustering
43.2.1
k-Means
43.2.2
Mixed-Type Clustering
43.2.3
Hierarchical Clustering
43.2.4
Specctral Clustering
43.2.5
Expectation-Maximization Clustering
43.2.6
Clustering Ensembles
43.3
Classification
43.3.1
Logistic Regression
43.3.2
Discriminant Analysis
43.3.3
Decision Trees
43.3.4
Naïve Bayes Classification
43.3.5
Support Vector Machines
43.3.6
Artificial Neural Networks
43.3.7
Ensemble Learning Methods
44
Text Mining
44.1
Text Processing
44.2
Text Visualization
44.3
Simple Text Analysis
44.4
Sentiment Analysis
45
Anomaly Detection and Outlier Analysis
46
Bayesian Data Analysis
47
Network Analysis
48
Recommender Engines
49
Ranking and Ordering
Appendix
A
Place holder
B
Cheatsheet for the authors
B.1
Writing in markdown
B.1.1
Convert latex to markdown
B.2
bookdown
B.3
Free images
B.4
Figures
B.4.1
Accessible figures
B.5
Including verbatim
R
code chunks inside
Rmarkdown
C
A List of Projects
C.1
Musician productivity
Data and background material
Target techniques
Guided steps
C.2
Moral machine
Data and background material
Target technique
Guided steps
C.3
Cost of lying
Data and background material
Target technique
Guided steps
References
Published with bookdown
Data Analysis, Data Science, Data Understanding
Module 36
Ranking and Ordering