
About me
I'm a PhD candidate with the INDE Lab at the University of Amsterdam. In my opinion, tables are a promising modality for representation learning with too much application potential to ignore. Therefore, I research Table Representation Learning (TRL) and its applications in data management and analysis. My broader interest is in Intelligent Data Systems with a focus on relational tables.
This interest started at the MIT Media Lab, where I developed Sherlock, a deep learning method for detecting table semantics at scale, enabling applications like data validation. The interest from industry in Sherlock and observation of the dominance of tables across the data landscape, inspired me in 2020 to dedicate my PhD to learned table models and their applications in practice. A piece of this puzzle is GitTables: a dataset of 1.7M tables (but continuously growing) extracted from CSV files on GitHub and enriched with table semantics such as semantic column types.
To stimulate research on TRL, I initiated the Table Representation Learning workshop (NeurIPS).
As part of the wider research community, I support JSys as Assistant Editor, co-organize Data Management for E2E ML (SIGMOD 2023) and the SemTab challenge (2021/2022). I review for various tracks/workshops at e.g. VLDB, EDBT, NeurIPS, ICML, WWW.
Besides academia, I am member of the supervisory board of a student consulting firm and was a data scientist for 2+ years, working on automating ML-driven analyses.
👉 Read more in my CV.
Selected projects
The projects below reflect my main research interest. But I enjoy working on other topics too. Check my profile on Google Scholar for my full publication record.
Framework for analyzing learned table embeddings based on the relational model and data distributions.
paper [TBC] | code
Corpus of 1.7M relational tables extracted from GitHub CSVs. Columns annotated w/ semantic types.
paper | website | dataset | code | video presentation | slides | podcast
A dataset of approximately 50K real-world database schemas extracted from SQL files from GitHub.
paper | code/dataset
Adaptive semantic column type detection system focusing on productization in industry contexts.
paper | video presentation
Method for semantic data type detection that takes column context into account, extends Sherlock.
paper | code
DL method for semantic data type detection of table columns (top-5 MIT Media Lab repos, 2 Aug 23).
paper | website | code
Corpus of over 31 million datasets from open data repositories, for benchmarking visualization studies.
paper | website
Recent news
Sep 12, 2023
Grateful to have received the Best Reviewer Award at the VLDB 2023 PhD workshop! Hearing that my reviews are considered valuable means a lot to me.
Aug 04, 2023
Excited to be invited to talk about Transformers for Tables at the Transformers at Work (15 Sep 2023), and about GitTables at the TaDA workshop (remote) at VLDB (1 Sep 2023). Very welcome to join!
Jul 20, 2023
Had a nice chat on the Disseminate Podcast w/ Jack about the thoughts and processes behind GitTables, and the potential of learned table representations. Listen to the podcast here, thanks for hosting me Jack!
Jul 13, 2023
Looking forward to catching up w/ the latest in Table Representation Learning and performance/applications of LLMs over tables, at the Table Representation Learning workshop 2023!