Skrub: machine learning for dataframes
Jérôme Dockès, Guillaume Lemaitre, Riccardo Cappuzzo
In the tutorial we will teach how to use skrub to easily tackle datasets that would be challenging to analyze using only scikit-learn.
First, we will consider a dataset containing a single table, with information about company employees such as hiring date and role description. While its structure is relatively simple, this dataset would be challenging without skrub due to the richness of the data types it contains, including dates, categories and text. We will show how skrub can create an interactive report to see a sample, data distribution plots, summary statistics and measurements of association between the different columns. We will then show that with one line of code we can already get a very good generalization performance, thanks to skrub's pre-built learner that makes the appropriate modelling choices for the different kinds of columns found in the data. Next, we will dive into the internals of that generic pipeline to show the different encoders it uses for dates, high- and low-cardinality categories, and which can be useful on their own.
In a second part, we will consider a dataset which could not be processed in a scikit-learn pipeline. The task is to detect fraud in e-commerce transactions and there are two tables with a one-to-many relationship: a table of orders and a table of products, where each order can contain multiple products. Even though it is a simple dataset, it would be difficult to handle correctly without skrub due to the need to vectorize the rich, heterogeneous data in the products table before it can be aggregated and joined to the orders (and thus the prediction targets). An upcoming addition to skrub (that will be merged in the next few weeks) allows us to easily build a pipeline that handles multiple tables and contains arbitrary dataframe transformations. Moreover, it allows tuning any of the choices we make with scikit-learn's grid or randomized search, without the need to manually construct a complex hyperparameter grid. We will explain the concepts behind these tools and illustrate them by building an effective learner for this dataset. We will also discuss the modelling choices involved and explore the available options through hyperparameter search.
Finally, we will conclude with a brief summary of ongoing development and future enhancements to skrub.