Responsible Data Integration: Next-generation Challenges

Tutorial at SIGMOD 2022

Data integration has been extensively studied by the data management community and is a core task in the data pre-processing step of ML pipelines. When the integrated data is used for analysis and model training, responsible data science requires addressing concerns about data quality and bias. We present a tutorial on data integration and responsibility, highlighting the existing efforts in responsible data integration along with research opportunities and challenges. In this tutorial, we encourage the community to audit data integration tasks with responsibility measures and develop integration techniques that optimize the requirements of responsible data science. We focus on three critical aspects: (1) the requirements to be considered for evaluating and auditing data integration tasks for quality and bias; (2) the data integration tasks that elicit attention to data responsibility measures and methods to satisfy these requirements; and, (3) techniques, tasks, and open problems in data integration that help achieve data responsibility.
Cite as:
Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish. In proceedings of the 2022 International Conference on Management of Data (SIGMOD '22), June 12--17, 2022, Philadelphia, PA, USA.
Link to PDF:
Responsible Data Integration: Next-generation Challenges
Link to Slides:
Tutorial Organization:
Part 1: next-generation requirements of responsible AI
Part 2: revisiting data integration
Part 3: fairness-aware data integration
Part 4: open problems