publications | Madelon Hulsebos

Below, a selection of my publications. Find a complete and up to date overview at my Google Scholar page.

2025

Towards Contextual Sensitive Data Detection

Liang Telkamp and Madelon Hulsebos

2025

Abs PDF

The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that con- sider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at: https://github.com/trl-lab/sensitive-data-detection.
SQaLe: A large text-to-SQL corpus grounded in real schemas

Cornelius Wolff, Daniel Gomm, and Madelon Hulsebos

In AI for Tabular Data workshop at EurIPS 2025, 2025

Abs PDF

Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe, a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: https://huggingface.co/datasets/trl-lab/SQaLe-text-to-SQL-dataset.
Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis

Daniel Gomm, Cornelius Wolff, and Madelon Hulsebos

2025

Abs PDF

Natural language interfaces to tabular data must handle ambiguities inherent to queries. Instead of treating ambiguity as a deficiency, we reframe it as a feature of cooperative interaction, where the responsibility of query specification is shared among the user and the system. We develop a principled framework distinguishing cooperative queries, i.e., queries that yield a resolvable interpretation, from uncooperative queries that cannot be resolved. Applying the framework to evaluations for tabular question answering and analysis, we analyze the queries in 15 popular datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system’s execution accuracy nor for evaluating interpretation capabilities. Our framework and analysis of queries shifts the perspective from fixing ambiguity to embracing cooperation in resolving queries. This reflection enables more informed design and evaluation for natural language interfaces for tabular data, for which we outline implications and directions for future research.
Unlocking the Full Potential of Data Science Requires Tabular Foundation Models, Agents, and Humans

Tianji Cong, Julian Eisenschlos, Daniel Gomm, and 21 more authors

2025

Abs PDF

Despite its vast potential, data science remains constrained by manual workflows and fragmented tools. Meanwhile, foundation models have transformed natural language and computer vision — and are beginning to bring similar breakthroughs to structured data, particularly the ubiquitous tabular data central to data science. At the same time, there are strong claims that fully autonomous agentic data science systems will emerge. We argue that, rather than replacing data scientists, the future of data science lies in a new paradigm that amplifies their impact: collaborative systems that tightly integrate agents and tabular foundation models (TFMs) with human experts. In this paper, we discuss the potential and challenges of navigating the interplay between these three and present a research agenda to guide this disruption toward a more accessible, robust, and human-centered data science.
Rethinking Dataset Discovery with DataScout

Rachel Lin, Bhavya Chopra, Wenjing Lin, and 3 more authors

In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, , 2025

Abs DOI PDF

Dataset Search—the process of finding appropriate datasets for a given task—remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g. granularity, attributes, size), semantics (e.g., data semantics, creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive—users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria—making query iteration challenging. To bridge these gaps, we introduce DataScout to proactively steer users through the process of dataset discovery via—(i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators, generated dynamically based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search reveals that users uniquely employ DataScout’s features not only for structured explorations, but also to glean feedback on their search queries and build conceptual models of the search space.
How well do LLMs reason over tabular data, really?

Cornelius Wolff and Madelon Hulsebos

In Proceedings of the 4th Table Representation Learning Workshop, 2025

Abs DOI PDF

Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM’s realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM’s performance on analytical tabular queries?Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.
Metadata Matters in Dense Table Retrieval

Daniel Gomm and Madelon Hulsebos

In ELLIS workshop on Representation Learning and Generative Models for Structured Data, 2025

Abs PDF

Recent advances in Large Language Models have enabled powerful systems that perform tasks by reasoning over tabular data. While these systems typically assume relevant data is provided with a query, real-world use cases are mostly open-domain, meaning they receive a query without context regarding the underlying tables. Retrieving relevant tables is typically done over dense embeddings of serialized tables. Yet, there is a limited understanding of the effectiveness of different inputs and serialization methods for using such off-the-shelf text-embedding models for table retrieval. In this work, we show that different serialization strategies result in significant variations in retrieval performance. Additionally, we surface shortcomings in commonly used benchmarks applied in open-domain settings, motivating further study and refinement.
Querying Templatized Document Collections with Large Language Models

Yiming Lin, Madelon Hulsebos, Ruiying Ma, and 4 more authors

In 2025 IEEE 41st International Conference on Data Engineering (ICDE), 2025

DOI PDF
Detecting Contextually Sensitive Data with AI

Liang Telkamp, Melanie Rabier, Javier Teran, and 1 more author

2025

Abs PDF

The rise of data sharing through private and public data portals necessitates more attention to detecting and protecting sensitive data before datasets get published. While research and practice have converged on the importance of documenting Personal Identifiable Information (PII), automatic, accurate and scalable methods for detecting such data in (tabular) datasets are behind. Moreover, we argue that sensitive data detection is more than PII type detection, and methods should consider the more fine-grained context of the dataset and how its publication can be misused beyond the identification of individuals. To guide research in this direction, we present a novel framework for contextual sensitive data detection based on type contextualization and domain contextualization. For type contextualization, we introduce the detect-then-reflect mechanism, in which large language models (LLMs) first detect potential sensitive column types in tables (e.g. PII types such as email address), and then assess their actual sensitivity based on the full table context. For domain contextualization, we propose the retrieve-then-detect mechanism that contextualizes LLMs in external domain knowledge, such as data governance instruction documents, to identify sensitive data beyond PII. Experiments on synthetic and humanitarian datasets show that: 1) the detect-then-reflect mechanism significantly reduces the number of false positives for type-based sensitive data detection, whereas 2) the retrieve-then-detect mechanism is an effective stepping stone for domain-specific sensitive data detection, and retrieval-augmented LLM explanations already provide a useful input for manual data auditing processes more efficient.

2024

TARGET: Benchmarking Table Retrieval for Generative Tasks

Xingyu Ji, Aditya Parameswaran, and Madelon Hulsebos

In NeurIPS 2024 Third Table Representation Learning Workshop, 2024

Abs PDF

The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.
It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?

Madelon Hulsebos, Wenjing Lin, Shreya Shankar, and 1 more author

In Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, Santiago, AA, Chile, 2024

Abs DOI PDF

Dataset search is a long-standing problem across both industry and academia. While most industry tools focus on identifying one or more datasets matching a user-specified query, most recent academic papers focus on the subsequent problems of join and union discovery for a given dataset. In this paper, we take a step back and ask: is the task of identifying an initial dataset really a solved problem? Are join and union discovery indeed the most pressing problems to work on? To answer these questions, we survey 89 data professionals and surface the objectives, processes, and tools used to search for structured datasets, along with the challenges faced when using existing systems. We uncover characteristics of data content and metadata that are most important for data professionals during search, such as granularity and data freshness. Informed by our analysis, we argue that dataset search is not yet a solved problem, but is, in fact, difficult to solve. To move the needle in the right direction, we distill four desiderata for future dataset search systems: iterative interfaces, hybrid querying, task-driven search and result diversity.
SchemaPile: A Large Collection of Relational Database Schemas

Till Döhmen, Radu Geacu, Madelon Hulsebos, and 1 more author

Proc. ACM Manag. Data, May 2024

Abs DOI PDF

Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.
spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines

Shreya Shankar, Haotian Li, Parth Asawa, and 7 more authors

Proc. VLDB Endow., Aug 2024

Abs DOI PDF

Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose data quality assertions to identify when LLMs may be making mistakes. We present spade, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototyping prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. spade therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, spade efficiently reduces the number of assertions by 14% and decreases false failures by 21% when compared to simpler baselines. spade has been deployed as an offering within LangSmith, LangChain’s LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries.

2023

Observatory: Characterizing Embeddings of Relational Tables

Tianji Cong, Madelon Hulsebos, Zhenjie Sun, and 2 more authors

Proc. VLDB Endow., Dec 2023

Abs DOI PDF

Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.
GitTables: A Large-Scale Corpus of Relational Tables

Madelon Hulsebos, Çagatay Demiralp, and Paul Groth

Proc. ACM Manag. Data, May 2023

Abs DOI PDF

The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.
Models and Practice of Neural Table Representations

Madelon Hulsebos, Xiang Deng, Huan Sun, and 1 more author

In Companion of the 2023 International Conference on Management of Data, Seattle, WA, USA, May 2023

Abs DOI PDF

In the last few years, the natural language processing community witnessed advances in neural representations of free-form text with transformer-based language models (LMs). Given the importance of knowledge available in relational tables, recent research efforts extend LMs by developing neural representations for tabular data. In this tutorial, we present these proposals with three main goals. First, we aim at introducing the potentials and limitations of current models to a database audience. Second, we want the attendees to see the benefit of such line of work in a large variety of data applications. Third, we would like to empower the audience with a new set of tools and to inspire them to tackle some of the important directions for neural table representations, including model and system design, evaluation, application and deployment. To achieve these goals, the tutorial is organized in two parts. The first part covers the background for neural table representations, including a survey of the most important systems. The second part is designed as a hands-on session, where attendees will use their laptop to explore this new framework and test neural models involving text and tabular data.
Introducing the Observatory Library for End-to-End Table Embedding Inference

Tianji Cong, Zhenjie Sun, Paul Groth, and 2 more authors

In NeurIPS 2023 Second Table Representation Learning Workshop, May 2023

Abs PDF

Transformer-based table embedding models have become prevalent for a wide range of applications involving tabular data. Such models require the serialization of a table as a sequence of tokens for model ingestion and embedding inference. Different downstream tasks require different kinds or levels of embeddings such as column or entity embeddings. Hence, various serialization and encoding methods have been proposed and implemented. Surprisingly, this conceptually simple process of creating table embeddings is not straightforward in practice for a few reasons: 1) a model may not natively expose a certain level of embedding; 2) choosing the correct table serialization and input preprocessing methods is difficult because there are many available; and 3) tables with a massive number of rows and columns cannot fit the input limit of models. In this work, we extend Observatory, a framework for characterizing embeddings of relational tables, by streamlining end-to-end inference of table embeddings, which eases the use of table embedding models in practice. The codebase of Observatory is publicly available at https://github.com/superctj/observatory.

2021

Augmenting Decision Making via Interactive What-If Analysis

Sneha Gathani, Madelon Hulsebos, James Gale, and 2 more authors

arXiv e-prints, Sep 2021

DOI PDF

2020

Sato: contextual semantic type detection in tables

Dan Zhang, Madelon Hulsebos, Yoshihiko Suhara, and 3 more authors

Proc. VLDB Endow., Jul 2020

Abs DOI PDF

Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes for training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the table context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.925 and 0.735, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.

2019

VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository

Kevin Hu, Snehalkumar ’Neil’ S. Gaikwad, Madelon Hulsebos, and 7 more authors

In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, Scotland Uk, Jul 2019

Abs DOI PDF

Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data repositories and online visualization galleries. On average, these datasets comprise 17 records over 3 dimensions and across the corpus, we find 51% of the dimensions record categorical data, 44% quantitative, and only 5% temporal. VizNet provides the necessary common baseline for comparing visualization design techniques, and developing benchmark models and algorithms for automating visual analysis. To demonstrate VizNet’s utility as a platform for conducting online crowdsourced experiments at scale, we replicate a prior study assessing the influence of user task and data distribution on visual encoding effectiveness, and extend it by considering an additional task: outlier detection. To contend with running such studies at scale, we demonstrate how a metric of perceptual effectiveness can be learned from experimental results, and show its predictive power across test datasets.
Sherlock: A Deep Learning Approach to Semantic Data Type Detection

Madelon Hulsebos, Kevin Hu, Michiel Bakker, and 5 more authors

In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, Jul 2019

Abs DOI PDF

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686,765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F_1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.