Syllabus - INTRODUCTION TO DATA SCIENCE (CD404)


CSE-Data Science/Data Science

INTRODUCTION TO DATA SCIENCE (CD404)

IV

Unit – I

Introduction

Introduction to Data Science – Evolution of Data Science – Data Science Roles – Stages in a Data Science Project – Applications of Data Science in various fields – Data Security Issues.

Unit – II

Data Collection and Data Pre-Processing

Data Collection Strategies – Data Pre-Processing Overview – Data Cleaning – Data Integration and Transformation – Data Reduction – Data Discretization.

Unit – III

Exploratory Data Analytics

Descriptive Statistics – Mean, Standard Deviation, Skewness and Kurtosis – Box Plots – Pivot Table – Heat Map – Correlation Statistics – ANOVA.

Unit – IV

Model Development

Simple and Multiple Regression – Model Evaluation using Visualization – Residual Plot – Distribution Plot – Polynomial Regression and Pipelines – Measures for In-sample Evaluation – Prediction and Decision Making.

Unit – V

Model Evaluation

Generalization Error – Out-of-Sample Evaluation Metrics – Cross Validation – Overfitting – Under Fitting and Model Selection – Prediction by using Ridge Regression – Testing Multiple Parameters by using Grid Search.

Practicals

  • READING AND WRITING DIFFERENT TYPES OF DATASETS using Python

  • Reading different types of data sets (.txt, .csv) from web and disk and writing in file in specific disk location.

  • Reading Excel data sheet in python.

  • Reading XML dataset in python.

  • VISUALIZATIONS: Find the data distributions using box and scatter plot.

  • Find the outliers using plot.

  • Plot the histogram, bar chart and pie chart on sample data

  • EXPLORATORY DATA ANALYSIS (EDA): Perform EDA on Credit Card Fraud Detection Dataset (open source dataset) for analyzing the data.

  • LINEAR REGRESSION MODEL FOR PREDICTION: Apply Regression Model techniques to predict the future values of data on the open source available datasets.

  • LOGISTIC REGRESSION MODEL: Import the Red-Wine dataset from the UCI Machine Learning Repository having three qualities of wines. Apply logistic regression model for multi-class classification of the wine categories.

  • MODEL EVALUATION USING RESIDUAL PLOT: Plotting Accuracy and Error Metrics against number of iterations for evaluation of model performance.

  • EVALUATING UNDER-FITTING AND OVER-FITTING: Plotting Learning curves for model evaluation for Under-fitting and Over-fitting

Reference Books

  • JojoMoolayil, “Smarter Decisions : The Intersection of IoT and Data Science”,PACKT, 2016.

  • Cathy O’Neil and Rachel Schutt , “Doing Data Science”, O'Reilly, 2015.

  • David Dietrich, Barry Heller, Beibei Yang, “Data Science and Big data Analytics”,EMC 2013

  • Raj, Pethuru, “Handbook of Research on Cloud Infrastructures for Big DataAnalytics”, IGI Global.