Hi! I'm a PhD candidate with the INDE Lab at the University of Amsterdam. So far, my research has been on learned table representations and their applications like data preparation, exploration, and analysis. My broader interest is in Intelligent Data Systems with a particular focus on systems for relational tables.
My interest in 'learning from tables' started at the MIT Media Lab, where I developed Sherlock, a deep learning method for detecting table semantics at scale, enabling applications like data validation. The huge interest from industry in Sherlock and the dominant presence of tables across data systems, inspired me to start a PhD in 2020 to focus on learned table models and their effectiveness in practice. A piece of this puzzle is GitTables: a dataset of 1.7M tables (but continuously growing) extracted from CSV files on GitHub and enriched with table semantics such as semantic column types.
In my opinion, tables have been a too long ignored modality for representation learning with too much application potential to ignore. Therefore, I founded the Table Representation Learning workshop (hosted at NeurIPS 2022).
As part of the wider research community, I support JSys as Assistant Editor, co-organize DEEM (SIGMOD 2023) and the SemTab challenge (2021/2022), and review for various workshops/tracks at e.g. VLDB, EDBT, NeurIPS, WWW.
Besides academia, I am member of the supervisory board of a student consulting firm and was a data scientist for 2+ years, working on automating ML-driven analyses.
You can read more in my CV.
Feel welcome to reach out (click on any channel in the footer)!
The projects below are close to my main research interest. But I enjoy working on other topics too. Check my profile on Google Scholar for my full publication record.
Corpus of 1.7M relational tables extracted from GitHub CSVs. Columns annotated w/ semantic types.
paper | website | dataset | code | video presentation | slides
A dataset of approximately 50K real-world database schemas extracted from SQL files from GitHub.
paper | code/dataset
Adaptive semantic column type detection system focusing on productization in industry contexts.
paper | video presentation
DL method for semantic data type detection of table columns (top-10 MIT Media Lab repos, 3/10/22).
paper | website | code
Corpus of over 31 million datasets from open data repositories, for benchmarking visualization studies.
paper | website
Mar 05, 2023
Excited to attend the workshop on ML for Systems and Systems for ML at BTW 2023 and visit the Information Systems group at Hasso-Plattner Institute. At both occassions, I’ll give a talk about (learning representations of) tables. See the slides of my talk here.
Feb 04, 2023
I’m excited to give a (3-hour!) tutorial at SIGMOD 2023 about Models and Practice of Neural Table Representations together with Xiang Deng, Huan Sun, and Paolo Papotti. This tutorial will give an overview of the field and hands-on session.
Dec 15, 2022
I’m co-organizing a workshop on Data Management for End-to-End Machine Learning (DEEM) at SIGMOD 2023. Read more here, and I hope to see you in Seattle soon!
Dec 13, 2022
This year I co-organized the workshop on Table Representation Learning (TRL) at NeurIPS 2022. The workshop was a great success and received much interest from various communities (NLP/ML/DB) which illustrates the importance and impact of TRL.
Jul 14, 2022
I’m co-organizing a workshop on Table Representation Learning (TRL) at NeurIPS 2022. Read more here, and I hope to see you in New Orleans very soon!