About me

I'm a faculty at CWI where I lead the Table Representation Learning (TRL) Lab and am member of the Database Architectures group. I'm also an ELLIS faculty with the Amsterdam unit. Previously, I was a postdoctoral fellow at UC Berkeley and obtained my PhD at the University of Amsterdam with research at Sigma Computing and MIT. My research has been supported by an NWO AiNed grant (>$1M), Accenture-BIDS Fellowship, and industry sponsors. Before academia, I spent 2+ years in industry working on automating data analysis pipelines with ML.

My research is focused on Table Representation Learning (TRL) and generative models for tabular data, as tables are prevalent in the data landscape, contain valuable data, and power important decisions in organizations such as governments, enterprises, and hospitals. The objective: democratizing insights from structured data ✨.

To establish tabular data as a key modality for AI, akin to images and text, I've been driving some TRL initatives since 2020. In particular, I founded the Table Representation Learning workshop series at NeurIPS and ACL, and the TRL research theme at the ELLIS unit Amsterdam. I organize various related efforts at Dagstuhl, SIGMOD and beyond, and review for tracks/workshops at e.g. VLDB, SIGMOD, NeurIPS, ICLR, WWW. Read more in my CV.

Research interests

I'm generally interested in Table Representation Learning (TRL), with a particular focus on:
  • (Relational) Table Embeddings
  • Generative Models (e.g. LLMs) for Relational Data, e.g. for QA/text2sql, data wrangling, etc
  • Retrieval over Tabular Data sources (data lakes, relational databases)
  • End-to-end (Agentic) Systems for Data Analysis and Data Science over Tabular Data
  • To stay updated on relevant events/updates from the broader git sTRL community, consider joining theTRL Discord, TRL Bluesky, and TRL Mailinglist.

    Selected projects

    The projects below reflect my main research interest. But I enjoy working on other topics too. Check my profile on Google Scholar for my full publication record.

    TARGET Benchmark [TRL@NeurIPS, 2024]
    The first benchmark for evaluating table retrieval methods in generative pipelines (e.g. RAG) over structured data.
    paper | website

    Dataset Search [HILDA@SIGMOD, 2024]
    1) Survey results surfacing why, what, and how is searched for data, open challenges, and system desiderata.
    2) System (tbc).
    1) paper survey

    SchemaPile [SIGMOD 2024]
    A dataset of approximately 221K real-world database schemas extracted from SQL files from GitHub.
    paper | dataset | code

    Observatory [PVLDB, TRL@NeurIPS, 2023]
    1) Framework for analyzing table embeddings based on the relational model, and desiderata for TRL models.
    2) Library for extracting table embeddings on row- column-, cell-level.
    1) analysis paper | 2) library paper | code

    GitTables [SIGMOD, 2023]
    Corpus of 1.7M relational tables extracted from GitHub CSVs. Columns annotated w/ semantic types.
    paper | website | dataset | code | video presentation | slides | podcast

    AdaTyper [CIDR, 2022]
    Adaptive semantic column type detection system focusing on productization in industry contexts.
    paper | video presentation

    Sherlock [KDD, 2019]
    DL method for semantic data type detection of table columns (top-5 MIT Media Lab repos, 2 Aug 23).
    paper | website | code

    VizNet [CHI, 2019]
    Corpus of over 31 million datasets from open data repositories, for benchmarking visualization studies.
    paper | website

    Recent news

  • Launched a Table Representation Learning initiative at ELLIS unit Amsterdam

    Nov 27, 2024

    Very excited to have established the new research theme on Table Representation Learning at ELLIS unit Amsterdam! From this establishment, I will soon organize a monthly seminar and workshops on TRL. The first ELLIS workshop around TRL is on 27 February in Amsterdam!

  • Joined the ELLIS society

    Aug 27, 2024

    As I am moving back from the Bay Area to Amsterdam – permanently, this time :) – I am very excited to become member of the European Laboratory for Learning and Intelligent Systems (ELLIS) society, and strengthen the roots of my research around TRL in Europe. Expect some “TRL in Europe” initatives to come soon..!

  • Table Representation Learning workshop @ NeurIPS 2024

    Aug 26, 2024

    Excited to organize the third edition of the Table Representation Learning workshop at NeurIPS 2024! Besides hosting the latest work on models/performance/applications of representation learning and generative models over tables (looking forward!), we’ll have some exciting announcements to share with the TRL community..

  • SIGMOD recap!

    Jun 20, 2024

    Attended SIGMOD in Chile! It was fun and busy, Santiago was lovely. Co-organized the DEEM workshop and presented the survey on dataset search in practice at HILDA. Generally, discussions and talks about retrieval, (structured) data semantics, and vector databases had my particular interest this year.

  • Awarded AiNed Fellowship Grant funding 5-year research project at CWI

    Mar 20, 2024

    Thrilled to share that I’m awarded the AiNed Fellowship Grant (worth $1M) to lead the 5-year DataLibra research project at CWI in Amsterdam starting fall 2024. DataLibra is focused on democratizing insight retrieval from structured data through representation learning and generative models for relational tables.