
About me
I'm a faculty at CWI where I lead the Table Representation Learning (TRL) Lab and am member of the Database Architectures group. I'm also an ELLIS faculty with the Amsterdam unit. Previously, I was a postdoctoral fellow at UC Berkeley and obtained my PhD at the University of Amsterdam with research at Sigma Computing and MIT. My research has been supported by an NWO AiNed grant (>$1M), Accenture-BIDS Fellowship, and industry sponsors. Before academia, I spent 2+ years in industry working on automating data analysis pipelines with ML.My research is focused on Table Representation Learning (TRL) and generative models for tabular data, as tables are prevalent in the data landscape, contain valuable data, and power important decisions in organizations such as governments, enterprises, and hospitals. The objective: democratizing insights from structured data ✨.
To establish tabular data as a key modality for AI, akin to images and text, I've been driving some TRL initatives since 2020. In particular, I founded the Table Representation Learning workshop series at NeurIPS and ACL, and the TRL research theme at the ELLIS unit Amsterdam. I organize various related efforts at Dagstuhl, SIGMOD and beyond, and review for tracks/workshops at e.g. VLDB, SIGMOD, NeurIPS, ICLR, WWW. Read more in my CV.
Research interests
I'm generally interested in Table Representation Learning (TRL), with a particular focus on:Selected projects
The projects below reflect my main research interest. But I enjoy working on other topics too. Check my profile on Google Scholar for my full publication record.
The first benchmark for evaluating table retrieval methods in generative pipelines (e.g. RAG) over structured data.
paper | website
1) Survey results surfacing why, what, and how is searched for data, open challenges, and system desiderata.
2) System (tbc).
1) paper survey
A dataset of approximately 221K real-world database schemas extracted from SQL files from GitHub.
paper | dataset | code
1) Framework for analyzing table embeddings based on the relational model, and desiderata for TRL models.
2) Library for extracting table embeddings on row- column-, cell-level.
1) analysis paper | 2) library paper | code
Corpus of 1.7M relational tables extracted from GitHub CSVs. Columns annotated w/ semantic types.
paper | website | dataset | code | video presentation | slides | podcast
Adaptive semantic column type detection system focusing on productization in industry contexts.
paper | video presentation
DL method for semantic data type detection of table columns (top-5 MIT Media Lab repos, 2 Aug 23).
paper | website | code
Corpus of over 31 million datasets from open data repositories, for benchmarking visualization studies.
paper | website
Recent news
Nov 27, 2024
Very excited to have established the new research theme on Table Representation Learning at ELLIS unit Amsterdam! From this establishment, I will soon organize a monthly seminar and workshops on TRL. The first ELLIS workshop around TRL is on 27 February in Amsterdam!
Aug 27, 2024
As I am moving back from the Bay Area to Amsterdam – permanently, this time :) – I am very excited to become member of the European Laboratory for Learning and Intelligent Systems (ELLIS) society, and strengthen the roots of my research around TRL in Europe. Expect some “TRL in Europe” initatives to come soon..!
Aug 26, 2024
Excited to organize the third edition of the Table Representation Learning workshop at NeurIPS 2024! Besides hosting the latest work on models/performance/applications of representation learning and generative models over tables (looking forward!), we’ll have some exciting announcements to share with the TRL community..
Jun 20, 2024
Attended SIGMOD in Chile! It was fun and busy, Santiago was lovely. Co-organized the DEEM workshop and presented the survey on dataset search in practice at HILDA. Generally, discussions and talks about retrieval, (structured) data semantics, and vector databases had my particular interest this year.
Mar 20, 2024
Thrilled to share that I’m awarded the AiNed Fellowship Grant (worth $1M) to lead the 5-year DataLibra research project at CWI in Amsterdam starting fall 2024. DataLibra is focused on democratizing insight retrieval from structured data through representation learning and generative models for relational tables.