Skrub: machine learning for dataframes
Jérôme Dockès, Guillaume Lemaitre, Riccardo Cappuzzo
In the tutorial we will teach how to use skrub to easily tackle datasets that would be challenging to analyze using only scikit-learn. In this regards, we show how skrub can be combined with scikit-learn to address some time series forecasting use case.
First, we give a short introduction regarding the scope of the skrub library. We show that some tedious tasks around machine learning can be reduce with a couple of out of the shelve functionalities.
Then, we focus on the skrub DataOps that allows to combine data wrangling operations, using commonly tools such as pandas or polars, with common machine learning pipeline, using scikit-learn. We focus on time series forecasting. First, we show how to build some common time series preprocessing using polars. Then, we show that we can record the set of such transformation into a graph allowing us to replay the same transformation in the future on a new set of data. Then, we combine this preprocessing stage together with a classic scikit-learn regressor to predict the desired target. Finally, we show how to evaluate this pipeline with cross-validation as well as how to perform hyperparameter search.
We conclude with a brief summary of ongoing development and future enhancements to skrub.
The material and instructions for the tutorial will be available here:
You can find an extended version here