Syllabus - INTRODUCTION TO DATA SCIENCE (CD404)
CSE-Data Science/Data Science
INTRODUCTION TO DATA SCIENCE (CD404)
IV
Unit – I
Introduction
Introduction to Data Science – Evolution of Data Science – Data Science Roles – Stages in a Data Science Project – Applications of Data Science in various fields – Data Security Issues.
Unit – II
Data Collection and Data Pre-Processing
Data Collection Strategies – Data Pre-Processing Overview – Data Cleaning – Data Integration and Transformation – Data Reduction – Data Discretization.
Unit – III
Exploratory Data Analytics
Descriptive Statistics – Mean, Standard Deviation, Skewness and Kurtosis – Box Plots – Pivot Table – Heat Map – Correlation Statistics – ANOVA.
Unit – IV
Model Development
Simple and Multiple Regression – Model Evaluation using Visualization – Residual Plot – Distribution Plot – Polynomial Regression and Pipelines – Measures for In-sample Evaluation – Prediction and Decision Making.
Unit – V
Model Evaluation
Generalization Error – Out-of-Sample Evaluation Metrics – Cross Validation – Overfitting – Under Fitting and Model Selection – Prediction by using Ridge Regression – Testing Multiple Parameters by using Grid Search.
Practicals
- READING AND WRITING DIFFERENT TYPES OF DATASETS using Python
- Reading different types of data sets (.txt, .csv) from web and disk and writing in file in specific disk location.
- Reading Excel data sheet in python.
- Reading XML dataset in python.
- VISUALIZATIONS: Find the data distributions using box and scatter plot.
- Find the outliers using plot.
- Plot the histogram, bar chart and pie chart on sample data
- EXPLORATORY DATA ANALYSIS (EDA): Perform EDA on Credit Card Fraud Detection Dataset (open source dataset) for analyzing the data.
- LINEAR REGRESSION MODEL FOR PREDICTION: Apply Regression Model techniques to predict the future values of data on the open source available datasets.
- LOGISTIC REGRESSION MODEL: Import the Red-Wine dataset from the UCI Machine Learning Repository having three qualities of wines. Apply logistic regression model for multi-class classification of the wine categories.
- MODEL EVALUATION USING RESIDUAL PLOT: Plotting Accuracy and Error Metrics against number of iterations for evaluation of model performance.
- EVALUATING UNDER-FITTING AND OVER-FITTING: Plotting Learning curves for model evaluation for Under-fitting and Over-fitting
Reference Books
-
JojoMoolayil, “Smarter Decisions : The Intersection of IoT and Data Science”,PACKT, 2016.
-
Cathy O’Neil and Rachel Schutt , “Doing Data Science”, O'Reilly, 2015.
-
David Dietrich, Barry Heller, Beibei Yang, “Data Science and Big data Analytics”,EMC 2013
-
Raj, Pethuru, “Handbook of Research on Cloud Infrastructures for Big DataAnalytics”, IGI Global.