publications
Below, a selection of my publications. Find a complete and up to date overview at my Google Scholar page.
2025
- Rethinking Dataset Discovery with DataScoutRachel Lin, Bhavya Chopra, Wenjing Lin, and 3 more authorsIn Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, , 2025
Dataset Search—the process of finding appropriate datasets for a given task—remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g. granularity, attributes, size), semantics (e.g., data semantics, creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive—users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria—making query iteration challenging. To bridge these gaps, we introduce DataScout to proactively steer users through the process of dataset discovery via—(i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators, generated dynamically based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search reveals that users uniquely employ DataScout’s features not only for structured explorations, but also to glean feedback on their search queries and build conceptual models of the search space.
- How well do LLMs reason over tabular data, really?Cornelius Wolff and Madelon HulsebosIn Proceedings of the 4th Table Representation Learning Workshop, 2025
Large Language Models (LLMs) excel in natural language tasks, but less is known about their reasoning capabilities over tabular data. Prior analyses devise evaluation strategies that poorly reflect an LLM’s realistic performance on tabular queries. Moreover, we have a limited understanding of the robustness of LLMs towards realistic variations in tabular inputs. Therefore, we ask: Can general-purpose LLMs reason over tabular data, really?, and focus on two questions 1) are tabular reasoning capabilities of general-purpose LLMs robust to real-world characteristics of tabular inputs, and 2) how can we realistically evaluate an LLM’s performance on analytical tabular queries?Building on a recent tabular reasoning benchmark, we first surface shortcomings of its multiple-choice prompt evaluation strategy, as well as commonly used free-form text metrics such as SacreBleu and BERT-score. We show that an LLM-as-a-judge procedure yields more reliable performance insights and unveil a significant deficit in tabular reasoning performance of LLMs. We then extend the tabular inputs reflecting three common characteristics in practice: 1) missing values, 2) duplicate entities, and 3) structural variations. Experiments show that the tabular reasoning capabilities of general-purpose LLMs suffer from these variations, stressing the importance of improving their robustness for realistic tabular inputs.
- Metadata Matters in Dense Table RetrievalDaniel Gomm and Madelon HulsebosIn ELLIS workshop on Representation Learning and Generative Models for Structured Data, 2025
Recent advances in Large Language Models have enabled powerful systems that perform tasks by reasoning over tabular data. While these systems typically assume relevant data is provided with a query, real-world use cases are mostly open-domain, meaning they receive a query without context regarding the underlying tables. Retrieving relevant tables is typically done over dense embeddings of serialized tables. Yet, there is a limited understanding of the effectiveness of different inputs and serialization methods for using such off-the-shelf text-embedding models for table retrieval. In this work, we show that different serialization strategies result in significant variations in retrieval performance. Additionally, we surface shortcomings in commonly used benchmarks applied in open-domain settings, motivating further study and refinement.
- Detecting Contextually Sensitive Data with AILiang Telkamp, Melanie Rabier, Javier Teran, and 1 more author2025
The rise of data sharing through private and public data portals necessitates more attention to detecting and protecting sensitive data before datasets get published. While research and practice have converged on the importance of documenting Personal Identifiable Information (PII), automatic, accurate and scalable methods for detecting such data in (tabular) datasets are behind. Moreover, we argue that sensitive data detection is more than PII type detection, and methods should consider the more fine-grained context of the dataset and how its publication can be misused beyond the identification of individuals. To guide research in this direction, we present a novel framework for contextual sensitive data detection based on type contextualization and domain contextualization. For type contextualization, we introduce the detect-then-reflect mechanism, in which large language models (LLMs) first detect potential sensitive column types in tables (e.g. PII types such as email address), and then assess their actual sensitivity based on the full table context. For domain contextualization, we propose the retrieve-then-detect mechanism that contextualizes LLMs in external domain knowledge, such as data governance instruction documents, to identify sensitive data beyond PII. Experiments on synthetic and humanitarian datasets show that: 1) the detect-then-reflect mechanism significantly reduces the number of false positives for type-based sensitive data detection, whereas 2) the retrieve-then-detect mechanism is an effective stepping stone for domain-specific sensitive data detection, and retrieval-augmented LLM explanations already provide a useful input for manual data auditing processes more efficient.
2024
- TARGET: Benchmarking Table Retrieval for Generative TasksXingyu Ji, Aditya Parameswaran, and Madelon HulsebosIn NeurIPS 2024 Third Table Representation Learning Workshop, 2024
The data landscape is rich with structured data, often of high value to organizations, driving important applications in data analysis and machine learning. Recent progress in representation learning and generative models for such data has led to the development of natural language interfaces to structured data, including those leveraging text-to-SQL. Contextualizing interactions, either through conversational interfaces or agentic components, in structured data through retrieval-augmented generation can provide substantial benefits in the form of freshness, accuracy, and comprehensiveness of answers. The key question is: how do we retrieve the right table(s) for the analytical query or task at hand? To this end, we introduce TARGET: a benchmark for evaluating TAble Retrieval for GEnerative Tasks. With TARGET we analyze the retrieval performance of different retrievers in isolation, as well as their impact on downstream tasks. We find that dense embedding-based retrievers far outperform a BM25 baseline which is less effective than it is for retrieval over unstructured text. We also surface the sensitivity of retrievers across various metadata (e.g., missing table titles), and demonstrate a stark variation of retrieval performance across datasets and tasks. TARGET is available at https://target-benchmark.github.io.
- It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?Madelon Hulsebos, Wenjing Lin, Shreya Shankar, and 1 more authorIn Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, Santiago, AA, Chile, 2024
Dataset search is a long-standing problem across both industry and academia. While most industry tools focus on identifying one or more datasets matching a user-specified query, most recent academic papers focus on the subsequent problems of join and union discovery for a given dataset. In this paper, we take a step back and ask: is the task of identifying an initial dataset really a solved problem? Are join and union discovery indeed the most pressing problems to work on? To answer these questions, we survey 89 data professionals and surface the objectives, processes, and tools used to search for structured datasets, along with the challenges faced when using existing systems. We uncover characteristics of data content and metadata that are most important for data professionals during search, such as granularity and data freshness. Informed by our analysis, we argue that dataset search is not yet a solved problem, but is, in fact, difficult to solve. To move the needle in the right direction, we distill four desiderata for future dataset search systems: iterative interfaces, hybrid querying, task-driven search and result diversity.
- SchemaPile: A Large Collection of Relational Database SchemasTill Döhmen, Radu Geacu, Madelon Hulsebos, and 1 more authorProc. ACM Manag. Data, May 2024
Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.
- spade: Synthesizing Data Quality Assertions for Large Language Model PipelinesShreya Shankar, Haotian Li, Parth Asawa, and 7 more authorsProc. VLDB Endow., Aug 2024
Large language models (LLMs) are being increasingly deployed as part of pipelines that repeatedly process or generate data of some sort. However, a common barrier to deployment are the frequent and often unpredictable errors that plague LLMs. Acknowledging the inevitability of these errors, we propose data quality assertions to identify when LLMs may be making mistakes. We present spade, a method for automatically synthesizing data quality assertions that identify bad LLM outputs. We make the observation that developers often identify data quality issues during prototyping prior to deployment, and attempt to address them by adding instructions to the LLM prompt over time. spade therefore analyzes histories of prompt versions over time to create candidate assertion functions and then selects a minimal set that fulfills both coverage and accuracy requirements. In testing across nine different real-world LLM pipelines, spade efficiently reduces the number of assertions by 14% and decreases false failures by 21% when compared to simpler baselines. spade has been deployed as an offering within LangSmith, LangChain’s LLM pipeline hub, and has been used to generate data quality assertions for over 2000 pipelines across a spectrum of industries.
2023
- Observatory: Characterizing Embeddings of Relational TablesTianji Cong, Madelon Hulsebos, Zhenjie Sun, and 2 more authorsProc. VLDB Endow., Dec 2023
Language models and specialized table embedding models have recently demonstrated strong performance on many tasks over tabular data. Researchers and practitioners are keen to leverage these models in many new application contexts; but limited understanding of the strengths and weaknesses of these models, and the table representations they generate, makes the process of finding a suitable model for a given task reliant on trial and error. There is an urgent need to gain a comprehensive understanding of these models to minimize inefficiency and failures in downstream usage.To address this need, we propose Observatory, a formal framework to systematically analyze embedding representations of relational tables. Motivated both by invariants of the relational data model and by statistical considerations regarding data distributions, we define eight primitive properties, and corresponding measures to quantitatively characterize table embeddings for these properties. Based on these properties, we define an extensible framework to evaluate language and table embedding models. We collect and synthesize a suite of datasets and use Observatory to analyze nine such models. Our analysis provides insights into the strengths and weaknesses of learned representations over tables. We find, for example, that some models are sensitive to table structure such as column order, that functional dependencies are rarely reflected in embeddings, and that specialized table embedding models have relatively lower sample fidelity. Such insights help researchers and practitioners better anticipate model behaviors and select appropriate models for their downstream tasks, while guiding researchers in the development of new models.
- GitTables: A Large-Scale Corpus of Relational TablesMadelon Hulsebos, Çagatay Demiralp, and Paul GrothProc. ACM Manag. Data, May 2023
The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.
- Models and Practice of Neural Table RepresentationsMadelon Hulsebos, Xiang Deng, Huan Sun, and 1 more authorIn Companion of the 2023 International Conference on Management of Data, Seattle, WA, USA, May 2023
In the last few years, the natural language processing community witnessed advances in neural representations of free-form text with transformer-based language models (LMs). Given the importance of knowledge available in relational tables, recent research efforts extend LMs by developing neural representations for tabular data. In this tutorial, we present these proposals with three main goals. First, we aim at introducing the potentials and limitations of current models to a database audience. Second, we want the attendees to see the benefit of such line of work in a large variety of data applications. Third, we would like to empower the audience with a new set of tools and to inspire them to tackle some of the important directions for neural table representations, including model and system design, evaluation, application and deployment. To achieve these goals, the tutorial is organized in two parts. The first part covers the background for neural table representations, including a survey of the most important systems. The second part is designed as a hands-on session, where attendees will use their laptop to explore this new framework and test neural models involving text and tabular data.
- Introducing the Observatory Library for End-to-End Table Embedding InferenceTianji Cong, Zhenjie Sun, Paul Groth, and 2 more authorsIn NeurIPS 2023 Second Table Representation Learning Workshop, May 2023
Transformer-based table embedding models have become prevalent for a wide range of applications involving tabular data. Such models require the serialization of a table as a sequence of tokens for model ingestion and embedding inference. Different downstream tasks require different kinds or levels of embeddings such as column or entity embeddings. Hence, various serialization and encoding methods have been proposed and implemented. Surprisingly, this conceptually simple process of creating table embeddings is not straightforward in practice for a few reasons: 1) a model may not natively expose a certain level of embedding; 2) choosing the correct table serialization and input preprocessing methods is difficult because there are many available; and 3) tables with a massive number of rows and columns cannot fit the input limit of models. In this work, we extend Observatory, a framework for characterizing embeddings of relational tables, by streamlining end-to-end inference of table embeddings, which eases the use of table embedding models in practice. The codebase of Observatory is publicly available at https://github.com/superctj/observatory.
2021
2020
- Sato: contextual semantic type detection in tablesDan Zhang, Madelon Hulsebos, Yoshihiko Suhara, and 3 more authorsProc. VLDB Endow., Jul 2020
Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes for training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the table context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.925 and 0.735, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.
2019
- VizNet: Towards A Large-Scale Visualization Learning and Benchmarking RepositoryKevin Hu, Snehalkumar ’Neil’ S. Gaikwad, Madelon Hulsebos, and 7 more authorsIn Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, Scotland Uk, Jul 2019
Researchers currently rely on ad hoc datasets to train automated visualization tools and evaluate the effectiveness of visualization designs. These exemplars often lack the characteristics of real-world datasets, and their one-off nature makes it difficult to compare different techniques. In this paper, we present VizNet: a large-scale corpus of over 31 million datasets compiled from open data repositories and online visualization galleries. On average, these datasets comprise 17 records over 3 dimensions and across the corpus, we find 51% of the dimensions record categorical data, 44% quantitative, and only 5% temporal. VizNet provides the necessary common baseline for comparing visualization design techniques, and developing benchmark models and algorithms for automating visual analysis. To demonstrate VizNet’s utility as a platform for conducting online crowdsourced experiments at scale, we replicate a prior study assessing the influence of user task and data distribution on visual encoding effectiveness, and extend it by considering an additional task: outlier detection. To contend with running such studies at scale, we demonstrate how a metric of perceptual effectiveness can be learned from experimental results, and show its predictive power across test datasets.
- Sherlock: A Deep Learning Approach to Semantic Data Type DetectionMadelon Hulsebos, Kevin Hu, Michiel Bakker, and 5 more authorsIn Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, Jul 2019
Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on 686,765 data columns retrieved from the VizNet corpus by matching 78 semantic types from DBpedia to column headers. We characterize each matched column with 1,588 features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F_1 score of 0.89, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.