In order to include categorical features in your Machine Learning model, you have to encode them numerically using "dummy" or "one-hot" encoding. But how do you do this correctly using scikit-learn?
In this video, you'll learn how to use OneHotEncoder and ColumnTransformer to encode your categorical features and prepare your feature matrix in a single step. You'll also learn how to include this step within a Pipeline so that you can cross-validate your model and preprocessing steps simultaneously. Finally, you'll learn why you should use scikit-learn (rather than pandas) for preprocessing your dataset.
AGENDA:
0:00 Introduction
0:22 Why should you use a Pipeline?
2:30 Preview of the lesson
3:35 Loading and preparing a dataset
6:11 Cross-validating a simple model
10:00 Encoding categorical features with OneHotEncoder
15:01 Selecting columns for preprocessing with ColumnTransformer
19:00 Creating a two-step Pipeline
19:54 Cross-validating a Pipeline
21:44 Making predictions on new data
23:43 Recap of the lesson
24:50 Why should you use scikit-learn (rather than pandas) for preprocessing?
CODE FROM THIS VIDEO: https://github.com/justmarkham/scikit-learn-videos/blob/master/10_categorical_features.ipynb
WANT TO JOIN MY NEXT LIVE WEBCAST? Become a member ($5/month):
https://www.patreon.com/dataschool
=== RELATED RESOURCES ===
OneHotEncoder documentation: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features
ColumnTransformer documentation: https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
Pipeline documentation: https://scikit-learn.org/stable/modules/compose.html#pipeline
My video on cross-validation: https://www.youtube.com/watch?v=6dbrR-WymjI&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=7
My video on grid search: https://www.youtube.com/watch?v=Gol_qOgRqfA&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=8
My lesson notebook on StandardScaler: https://nbviewer.jupyter.org/github/justmarkham/DAT8/blob/master/notebooks/19_advanced_sklearn.ipynb
=== WANT TO GET BETTER AT MACHINE LEARNING? ===
1) WATCH my scikit-learn video series: https://www.youtube.com/playlist?list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
2) SUBSCRIBE for more videos: https://www.youtube.com/dataschool?sub_confirmation=1
3) ENROLL in my Machine Learning course: https://www.dataschool.io/learn/
4) LET'S CONNECT!
- Newsletter: https://www.dataschool.io/subscribe/
- Twitter: https://twitter.com/justmarkham
- Facebook: https://www.facebook.com/DataScienceSchool/
- LinkedIn: https://www.linkedin.com/in/justmarkham/