Lesson 1: Visualizing relationships in data
Seeing relationships in data and predicting based on them; Simpson’s paradox
Lesson 2: Probability
Probability; Bayes Rule; Correlation vs. Causation
Lesson 3: Estimation
Maximum Likelihood Estimation; Mean, Median, Mode; Standard Deviation, Variance
Lesson 4: Outliers and Normal Distribution
Outliers, Quartiles; Binomial Distribution; Central Limit Theorem; Manipulating Normal Distribution
Lesson 5: Inference
Confidence intervals; Hypothesis Testing
Lesson 6: Regression
Linear regression; correlation
Lesson 7: Final Exam
Join instructor Karthik Ramasamy and the first Udacity-Twitter Storm Hackathon to cover the motivation and practice of real-time, distributed, fault-tolerant data processing. Dive into basic Storm Topologies by linking to a real-time d3 Word Cloud Visualization using Redis, Flask, and d3.
Explore Storm basics by programming Bolts, linking Spouts, and finally connecting to the live Twitter API to process real-time tweets. Explore open source components by connecting a Rolling Count Bolt to your topology to visualize Rolling Top Tweeted Words.
Go beyond Storm basics by exploring multi-language capabilities to download and parse real-time Tweeted URLs in Python using Beautiful Soup. Integrate complex open source bolts to calculate Top-N words to visualize real-time Top-N Hashtags. Finally, use stream grouping concepts to easily create streaming join to connect and dynamically process multiple streams.
Work on your final project and we cover additional questions and topics brought up by Hackathon participants. Explore Vagrant, VirtualBox, Redis, Flask, and d3 further if you are interested!
Final Project: Construct a Storm Topology
Design a Storm Topology and new bolt that uses streaming joins to dynamically calculate Top-N Hashtags and display real-time tweets that contain trending Top Hashtags. Post your visualization to the forum and tweet them to your Twitter followers.
Use additional features of the real-time Twitter sample stream or use any data source to drive your real-time d3 visualization.
Lesson 1: Data Extraction Fundamentals
- Assessing the Quality of Data
- Intro to Tabular Formats
- Parsing CSV
- Parsing XLS with XLRD
- Intro to JSON
- Using Web APIs
Lesson 2: Data in More Complex Formats
- Intro to XML
- XML Design Principles
- Parsing XML
- Web Scraping
- Parsing HTML
Lesson 3: Data Quality
- What is Data Cleaning?
- Sources of Dirty Data
- Measuring Data Quality
- A Blueprint for Cleaning
- Auditing Validity
- Auditing Accuracy
- Auditing Completeness
- Auditing Consistency
- Auditing Uniformity
Lesson 4: Working with MongoDB
- Data Modelling in MongoDB
- Introduction to PyMongo
- Field Queries
- Projection Queries
- Getting Data into MongoDB
- Using mongoimport
- Operators like $gt, $lt, $exists, $regex
- Querying Arrays and using $in and $all Operators
- Changing entries: $update, $set, $unset
Lesson 5: Analyzing Data
- Examples of Aggregation Framework
- The Aggregation Pipeline
- Aggregation Operators: $match, $project, $unwind, $group
- Multiple Stages Using a Given Operator
Lesson 6: Case Study – OpenStreetMap Data
- Using iterative parsing for large datafiles
- Open Street Map XML Overview
- Exercises around OpenStreetMap data
- Final Project Instructions
Lessons 1-4: Supervised Classification
Naive Bayes: We jump in headfirst, learning perhaps the world’s greatest algorithm for classifying text.
Support Vector Machines (SVMs): One of the top 10 algorithms in machine learning, and a must-try for many classification tasks. What makes it special? The ability to generate new features independently and on the fly.
Decision Trees: Extremely straightforward, often just as accurate as an SVM but (usually) way faster. The launch point for more sophisticated methods, like random forests and boosting.
Lesson 5: Datasets and Questions
Behind any great machine learning project is a great dataset that the algorithm can learn from. We were inspired by a treasure trove of email and financial data from the Enron corporation, which would normally be strictly confidential but became public when the company went bankrupt in a blizzard of fraud. Follow our lead as we wrestle this dataset into a machine-learning-ready format, in anticipation of trying to predict cases of fraud.
Lesson 6 and 7: Regressions and Outliers
Regressions are some of the most widely used machine learning algorithms, and rightly share prominence with classification. What’s a fast way to make mistakes in regression, though? Have troublesome outliers in your data. We’ll tackle how to identify and clean away those pesky data points.
Lesson 8: Unsupervised Learning
K-Means Clustering: The flagship algorithm when you don’t have labeled data to work with, and a quick method for pattern-searching when approaching a dataset for the first time.
Lessons 9-12: Features, Features, Features
Feature Creation: Taking your human intuition about the world and turning it into data that a computer can use.
Feature Selection: Einstein said it best: make everything as simple as possible, and no simpler. In this case, that means identifying the most important features of your data.
Principal Component Analysis: A more sophisticated take on feature selection, and one of the crown jewels of unsupervised learning.
Feature Scaling: Simple tricks for making sure your data and your algorithm play nicely together. Learning from Text: More information is in text than any other format, and there are some effective but simple tools for extracting that information.
Lessons 13-14: Validation and Evaluation
Training/testing data split: How do you know that what you’re doing is working? You don’t, unless you validate. The train-test split is simple to do, and the gold standard for understanding your results.
Cross-validation: Take the training/testing split and put it on steroids. Validate your machine learning results like a pro.
Precision, recall, and F1 score: After all this data-driven work, quantify your results with metrics tailored to what is most important to you.
Lesson 15: Wrapping it all Up
We take a step back and review what we’ve learned, and how it all fits together.
Mini-project at the end of each lesson
Final project: searching for signs of corporate fraud in Enron data
Lesson 1: What is EDA? (1 hour)
We’ll start by learn about what exploratory data analysis (EDA) is and why it is important. You’ll meet the amazing instructors for the course and find out about the course structure and final project.
Lesson 2: R Basics (3 hours)
EDA, which comes before formal hypothesis testing and modeling, makes use of visual methods to analyze and summarize data sets. R will be our tool for generating those visuals and conducting analyses. In this lesson, we will install RStudio and packages, learn the layout and basic commands of R, practice writing basic R scripts, and inspect data sets.
Lesson 3: Explore One Variable (4 hours)
We perform EDA to understand the distribution of a variable and to check for anomalies and outliers. Learn how to quantify and visualize individual variables within a data set as we begin to make sense of a pseudo-data set of Facebook users. While the data set does not contain real user data, it does contain a wealth of information. Through the lesson, we will create histograms and boxplots, transform variables, and examine tradeoffs in visualizations.
Problem Set 3 (2 hours)
Lesson 4: Explore Two Variables (4 hours)
EDA allows us to identify the most important variables and relationships within a data set before building predictive models. In this lesson, we will learn techniques for exploring the relationship between any two variables in a data set. We’ll create scatter plots, calculate correlations, and investigate conditional means.
Problem Set 4 (2 hours)
Lesson 5: Explore Many Variables (4 hours)
Data sets can be complex. In this lesson, we will learn powerful methods and visualizations for examining relationships among multiple variables. We’ll learn how to reshape data frames and how to use aesthetics like color and shape to uncover more information. Extending our knowledge of previous plots, we’ll continue to build intuition around the Facebook data set and explore some new data sets as well.
Problem Set 5 (2 hours)
Lesson 6: Diamonds and Price Predictions (2 hours)
Investigate the diamonds data set alongside Facebook Data Scientist, Solomon Messing. He’ll recap many of the strategies covered in the course and show how predictive modeling can allow us to determine a good price for a diamond. As a final project, you will create your own exploratory data analysis on a data set of your choice.
Final Project (10+ hours)
You’ve explored simulated Facebook user data and the diamonds data set. Now, it’s your turn to conduct your own exploratory data analysis. Choose one data set to explore (one provided by Udacity or your own) and create a RMD file that uncovers the patterns, anomalies and relationships of the data set.
Lesson 1a Visualization Fundamentals (2 hours)
Learn about the elements of great data visualization. In this lesson, you will meet data visualization experts, learn about data visualization in the context of data science, and learn how to represent data values in visual form.
Lesson 1b D3 Building Blocks (4 hours)
Learn how to use the open standards of the web to create graphical elements. You’ll learn how to select elements on the page, add SVG elements, and how to style SVG elements. Make use of all the Instructor Notes throughout this lesson if you have little to no experience with HTML and CSS.
Mini-Project 1: RAW Visualization (2 hours)
Create a data visualization using a software of your choice. We will provide recommendations for visualization software as well as data sets. We want you to get right into making data visualization so here’s your first chance!
Lesson 2a Design Principles (2 hours)
Which chart type should I use for my data? Which colors should I avoid when making graphics? How do I know if my graphic is effective? Investigate these questions, and learn about the World Cup data set which will be use throughout the rest of the course.
Lesson 2b Dimple.js (4 hours)
Mini-Project 2: Take Two (2-5 hours)
Find an existing data visualization, critique it for what it does well and what it doesn’t do well, and finally, recreate the graphic using a software tool of your choice. We recommend using Dimple.js, which is covered in Lesson 2b, but we don’t want you to feel constrained by the choice of tools. Use any tool that works for you.
At this point in the course, you can start the final project. The remaining content of the course covers narrative structures, types of bias, and maps. All of the code in Lesson 3 and Lesson 4 pertains to d3.js. If you’d like to learn d3.js and complete the final project using d3.js, then please continue. If you prefer to stop, you can complete the final project using dimple.js.
Lesson 3 Narratives (5 hours)
Learn how to incorporate different narrative structures into your visualizations and code along with Jonathan as you create a graphic for the World Cup data set. You’ll learn about different types of bias in the data visualization process and learn how to add context to your data visualizations. By the end of this lesson, you’ll have a solid foundation in D3.js.
Lesson 4 Animation and Interaction (5 hours)
Static graphics are great, but interactive graphics can be even better. Learn how to leverage animation and interaction to bring more data insights to your audience. Code along with Jonathan once again as you learn how to create a bubble map for the World Cup data set.
Final Project: Making an Effective Data Visualization (2 hours or more)
You will create a data visualization that conveys a clear message about a data set. You will use dimple.js or d3.js and collect feedback along the way to arrive at a polished product.
Lesson 1: Introduction to Data Science
- Introduction to Data Science
- What is a Data Scientist
- Pi-Chaun (Data Scientist @ Google): What is Data Science?
- Gabor (Data Scientist @ Twitter): What is Data Science?
- Problems Solved by Data Science
- Create a New Dataframe
Lesson 2: Data Wrangling
- What is Data Wrangling?
- Acquiring Data
- Common Data Formats
- What are Relational Databases?
- Aadhaar Data
- Aadhaar Data and Relational Databases
- Introduction to Databases Schemas
- Data in JSON Format
- How to Access an API efficiently
- Missing Values
- Easy Imputation
- Impute using Linear Regression
- Tip of the Imputation Iceberg
Lesson 3: Data Analysis
- Statistical Rigor
- Kurt (Data Scientist @ Twitter) – Why is Stats Useful?
- Introduction to Normal Distribution
- T Test
- Welch T Test
- Non-Parametric Tests
- Non-Normal Data
- Stats vs. Machine Learning
- Different Types of Machine Learning
- Prediction with Regression
- Cost Function
- How to Minimize Cost Function
- Coefficients of Determination
Lesson 4: Data Visualization
- Effective Information Visualization
- Napoleon’s March on Russia
- Don (Principal Data Scientist @ AT&T): Communicating Findings
- Rishiraj (Principal Data Scientist @ AT&T): Communicating Findings Well
- Visual Encodings
- Perception of Visual Cues
- Plotting in Python
- Data Scales
- Visualizing Time Series Data
Lesson 5: MapReduce
- Big Data and MapReduce
- Basics of MapReduce
- MapReduce with Aadhaar Data
- MapReduce with Subway Data