Responsible Data Science and Data Ethics, Big Data Management and Analysis, Data Mining, Machine Learning, Algorithm Design, Ranking, Compact Maxima Representatives, Combinatorial Geometry, Web Data Retrieval.
Publications
Abolfazl Asudeh, Azade Nazi, Nan Zhang, Gautam Das, H. V. Jagadish. RRR: Rank-Regret Representative. Accepted in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data, ACM, 2019.
Abolfazl Asudeh, H. V. Jagadish, Gerome Miklau, Julia Stoyanovich, On Obtaining Stable Rankings. in Proceedings of the VLDB Endowment (PVLDB), Vol. 12 Issue 3, 2019.
Abolfazl Asudeh, H. V. Jagadish, Julia Stoyanovich, and Gautam Das. Designing Fair Ranking Schemes. Accepted in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data, ACM, 2019.
Abolfazl Asudeh, Azade Nazi, Jees Augustine, Saravanan Thirumuruganathan,
Nan Zhang, Gautam Das, Divesh Srivastava. Leveraging Similarity Joins for Signal Reconstruction, in Proceedings of the VLDB Endowment (PVLDB), 11 (10): 1276-1288, 2018.
Sona Hasani, Saravanan Thirumuruganathan, Abolfazl Asudeh, Nick Koudas, Gautam Das. Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse. accepted in Proceedings of the VLDB Endowment (PVLDB), 2018.
Abolfazl Asudeh, Azade Nazi, Nan Zhang, and Gautam Das. Efficient Computation of Regret-ratio Minimizing Set: A Compact Maxima Representative, in SIGMOD '17: Proceedings of the 2017 International Conference on Management of Data, ACM, 2017.
Md Farhadur Rahman, Abolfazl Asudeh, Nick Koudas, Gautam Das. Efficient Computation of Subspace Skyline over Categorical Domains. in ACM International Conference on Information and Knowledge Management (CIKM), ACM, 2017.
Abolfazl Asudeh, Nan Zhang, and Gautam Das. Query Reranking As A Service, in Proceedings of the VLDB Endowment (PVLDB), Vol. 9 Issue 11, 2016.
Abolfazl Asudeh, Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. Discovering the Skyline of Web Databases, in Proceedings of the VLDB Endowment (PVLDB), Vol. 9 Issue 7, 2016.
Ning Yan, Sona Hasani, Abolfazl Asudeh, Chengkai Li, Generating Preview Tables for Entity Graphs, in SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data, ACM, 2016.
Abolfazl Asudeh, Gensheng Zhang, Naeemul Hassan, Chengkai Li, and Gergely V. Zaruba, Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons, in Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM), ACM, 2015.
Abolfazl Asudeh, Gergely V. Zaruba, and Sajal K. Das, A general model for MAC protocol selection in wireless sensor networks, in Ad Hoc Networks Journal, volume 36 - part 1, pages 189 - 202, issn 1570-8705, Elsevier, 2016.
Working Papers
Abolfazl Asudeh, Zhongjun Jin, H. V. Jagadish. Assessing and Remedying Coverage for a Given Dataset .CoRR, abs/1810.06742, 2018.
Abolfazl Asudeh, Azade Nazi, Nick Koudas, Gautam Das. Assisting Service Providers In Peer to peer Marketplaces: Maximizing Gain Over Flexible Attributes. CoRR, abs/1705.03028, April 2017.
Demo Papers
Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, H. V. Jagadish, Gerome Miklau, A Nutritional Label for Rankings, in SIGMOD, 2018.
Yeshwanth D. Gunasekaran, Abolfazl Asudeh, Sona Hasani, Nan Zhang, Ali Jaoua, Gautam Das, QR2: A Third-party Query Reranking Service Over Web Databases, in ICDE, 2018.
Collaborators
Projects
Data Ethics and Responsible Data Management: Ranking
In PVLDB 2019
We often have to rank items with multiple attributes in a dataset. A typical method to achieve this is to compute a goodness score for each item as a weighted sum of its attribute values, and then to rank by sorting on this score. Clearly, the ranking obtained depends on the weights used for this summation. Ideally, we would want the ranked order not to change if the weights are changed slightly. We call this property {\em stability} of the ranking. A consumer of a ranked list may trust the ranking more if it has high stability. A producer of a ranked list prefers to choose weights that result in a stable ranking, both to earn the trust of potential consumers and because a stable ranking is intrinsically likely to be more meaningful.
In this paper, we develop a framework that can be used to assess the stability of a provided ranking and to obtain a stable ranking within an "acceptable" range of weight values (called "the region of interest"). We address the case where the user cares about the rank order of the entire set of items, and also the case where the user cares only about the top-k items. Using a geometric interpretation, we propose the algorithms for producing the stable rankings. We also propose a randomized algorithm that uses Monte-Carlo estimation. To do so, we first propose an unbiased sampler for sampling the rankings (or top-k results) uniformly at random from the region of interest. In addition to the theoretical analyses, we conduct extensive experiments on real datasets that validate our proposal.
Accepted in SIGMOD 2019
Items from a database are often ranked based on a combination of multiple criteria. A user may have the flexibility to accept combinations that weigh these criteria differently, within limits. On the other hand, this choice of weights can greatly affect the fairness of the produced ranking.
We develop a system that helps users choose criterion weights that lead to greater fairness. We consider ranking functions that compute the score of each item as a weighted sum of (numeric) attribute values, and then sort items on their score. Each ranking function can be expressed as a vector of weights, or as a point in a multi-dimensional space. For a broad range of fairness criteria, we show how to efficiently identify regions in this space that satisfy these criteria. Using this identification method, our system is able to tell users whether their proposed ranking function satisfies the desired fairness criteria and, if it does not, to suggest the smallest modification that does. We develop user-controllable approximation that and indexing techniques that are applied during preprocessing, and support sub-second response times during the online phase.
Algorithmic decisions often result in scoring and ranking individuals to determine credit worthiness, qualifications for college admissions and employment, and compatibility as dating partners. While automatic and seemingly objective, ranking algorithms can low diversity. Furthermore, ranked results are often unstable — small changes in the input data or in the ranking methodology may lead to drastic changes in the output, making the result uninformative and easy to manipulate. Similar concerns apply in cases where items other than individuals are ranked, including colleges, academic departments, or products.
In this demonstration we present Ranking Facts, a Web-based application that generates a “nutritional label” for rankings. RankingFacts is made up of a collection of visual widgets that implement our latest research results on fairness, stability, and transparency for rankings, and that communicate details of the ranking methodology, or of the output, to the end user. We will showcase Ranking Facts on real datasets from different domains, including college rankings, criminal risk assessment, and financial services.
in PVLDB 2018
Signal reconstruction problem (SRP) is an important optimization problem where the objective is to identify a solution to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas including network traffic engineering, medical image reconstruction, acoustics, astronomy and many more. Most common approaches for SRP do not scale to large problem sizes. We propose a dual formulation of this problem and show how adapting database techniques developed for scalable similarity joins provides a significant speedup.
*: Invited to Best of VLDB'18 -- the special issue of VLDB Journal.
Accepted in SIGMOD 2019
Selecting the best items in a dataset is a common task in data exploration. However, the concept of ``best'' lies in the eyes of the beholder: different users may consider different attributes more important, and hence arrive at different rankings. Nevertheless, one can remove ``dominated'' items and create a ``representative'' subset of the data set, comprising the ``best items'' in it. A Pareto-optimal representative is guaranteed to contain the best item of each possible ranking, but it can be almost as big as the full data. Representative can be found if we relax the requirement to include the best item for every possible user, and instead just limit the users' ``regret''. Existing work defines regret as the loss in score by limiting consideration to the representative instead of the full data set, for any chosen ranking function. However, the score is often not a meaningful number and users may not understand its absolute value. Sometimes small ranges in score can include large fractions of the data set. In contrast, users do understand the notion of rank ordering. Therefore, alternatively, we consider the position of the items in the ranked list for defining the regret and propose the {\em rank-regret representative} as the minimal subset of the data containing at least one of the top-$k$ of any possible ranking function. This problem is NP-complete. We use the geometric interpretation of items to bound their ranks on ranges of functions and to utilize combinatorial geometry notions for developing effective and efficient approximation algorithms for the problem. Experiments on real datasets demonstrate that we can efficiently find small subsets with small rank-regrets.
*: One of the 3, out of 159, submissions that were accepted directly in the first submission round of SIGMOD'19.
In SIGMOD 2017
Finding the maxima of a database based on a user preference, especially when the ranking function is a linear combination of the attributes, has been the subject of recent research. A critical observation is that the convex hull is the subset of tuples that can be used to find the maxima of any linear function. However, in real world applications the convex hull can be a significant portion of the database, and thus its performance is greatly reduced. Thus, computing a subset limited to r tuples that minimizes the regret ratio (a measure of the user’s dissatisfaction with the result from the limited set versus the one from the entire database) is of interest. In this paper, we make several fundamental theoretical as well as practical advances in developing such a compact set. In the case of two dimensional databases, we develop an optimal linearithmic time algorithm by leveraging the ordering of skyline tuples. In the case of higher dimensions, the problem is known to be NP-complete. As one of our main results of this paper, we develop an approximation algorithm that runs in linearithmic time and guarantees a regret ratio, within any arbitrarily small user-controllable distance from the optimal regret ratio. The comprehensive set of experiments on both synthetic and publicly available real datasets confirm the efficiency, quality of output, and scalability of our proposed algorithms.
in PVLDB 2018
Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.
In PVLDB 2016 [paper][BibTex][slides]
The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This paper takes a different route on addressing the issue by defining a novel query reranking problem, i.e., we aim to design a thirdparty service that uses nothing but the public search interface of a client-server database to enable the on-the-fly processing of queries with any user-specified ranking functions (with or without selection conditions), no matter if the ranking function is supported by the database or not. We analyze the worst-case complexity of the problem and introduce a number of ideas, e.g., on-the-fly indexing, domination detection and virtual tuple pruning, to reduce the average-case cost of the query reranking algorithm. We also present extensive experimental results on real-world datasets, in both offline and live online systems, that demonstrate the effectiveness of our proposed techniques.
In PVLDB 2016 [paper][technical report][BibTex][slides]
Many web databases are "hidden" behind proprietary search interfaces that enforce the top-k output constraint, i.e., each query returns at most k of all matching tuples, preferentially selected and returned according to a proprietary ranking function.
In this paper, we initiate research into the novel problem of skyline discovery over top-k hidden web databases. Since skyline tuples provide critical insights into the database and include the top-ranked tuple for every possible ranking function following the monotonic order of attribute values, skyline discovery from a hidden web database can enable a wide variety of innovative third-party applications over one or multiple web databases.
In ICDE 2018 Demo Track
The ranked retrieval model has rapidly become the de-facto way for search query processing in web databases. Despite the extensive efforts on designing better ranking mechanisms, in practice, many such databases fail to address the diverse and sometimes contradicting preferences of users.
We present QR2, a third-party service that uses nothing but the public search interface of a web database and enables the on-the-fly processing of queries with any user-specified ranking functions, no matter if the ranking function is supported by the database or not.
In CIKM 2017
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications.
ArXiv e-prints 1705.03028
Peer to peer marketplaces enable transactional exchange of services directly between people. In such platforms, those providing a service are faced with various choices. For example in travel peer to peer marketplaces, although some amenities (attributes) in a property are fixed, others are relatively flexible and can be provided without significant effort. Providing an attribute is usually associated with a cost. Naturally, different sets of attributes may have a different “gains” (monetary or otherwise) for a service provider. Consequently, given a limited budget, deciding which attributes to offer is challenging. In this project, we propose techniques that help service providers in decision making.
In CIKM 2015 [paper][slides][BibTex]
This is the first study on crowdsourcing Pareto-optimal object finding, which has applications in public opinion collection, group decision making, and information exploration. Departing from prior studies on crowdsourcing skyline and ranking queries, it considers the case where objects do not have explicit attributes and preference relations on objects are strict partial orders. The partial orders are derived by aggregating crowdsourcers' responses to pairwise comparison questions. The goal is to find all Pareto-optimal objects by the fewest possible questions. It employs an iterative question-selection framework. Guided by the principle of eagerly identifying non-Pareto optimal objects, the framework only chooses candidate questions which must satisfy three conditions. This design is both sufficient and efficient, as it is proven to find a short terminal question sequence. The framework is further steered by two ideas---macro-ordering and micro-ordering. By different micro-ordering heuristics, the framework is instantiated into several algorithms with varying power in pruning questions. Experiment results using both real crowdsourcing marketplace and simulations exhibited not only orders of magnitude reductions in questions when compared with a brute-force approach, but also close-to-optimal performance from the most efficient instantiation.
Code: The framework and instantiations implemantation (in C++) is located here.
In SIGMOD 2016 [paper]
Users and developers are tapping into big, complex entity graphs for numerous applications. It is challenging to select entity graphs for a particular need, given abundant datasets from many sources and the oftentimes scarce information available for them. We propose methods to automatically produce preview tables for entity graphs, for compact presentation of important entity types and relationships. The preview tables assist users in attaining a quick and rough preview of the data. They can be shown in a limited display space for a user to browse and explore, before she decides to spend time and resources to fetch and investigate the complete dataset. We formulate several optimization problems that look for previews with the highest scores according to intuitive goodness measures, under various constraints on preview size and distance between preview tables. The optimization problem under distance constraint is NP-hard. We design a dynamic-programming algorithm and an Apriori-style algorithm for finding optimal previews. The experiments and user studies on Freebase demonstrated both the scoring measures' accuracy and the discovery algorithms' efficiency.
*: Awarded the Most Reproducible Paper Award of SIGMOD 2017
In Ad-Hoc Networks Journal, 2016 [pre-print version][BibTex] [Usage Report]
Wireless Sensor Networks (WSNs) are being deployed for different applications, each having its own structure, goals and requirements. Medium access control (MAC) protocols play a significant role in WSNs and hence should be tuned to the applications. However, there is no for selecting MAC protocols for different situations. Therefore, it is hard to decide which MAC protocol is good for a given situation. Having a precise model for each MAC protocol, on the other hand, is almost impossible. Using the intuition that the protocols in the same behavioral category perform similarly, our goal in this paper is to introduce a general model that selects the protocol(s) that satisfy the given requirements from the category that performs better for a given context. We define the Combined Performance Function (CPF) to demonstrate the performance of different categories protocols for different contexts. Having the general model, we then discuss the model scalability for adding new protocols, categories, requirements, and performance criteria. Considering energy consumption and delay as the initial performance criteria of the model, we focus on deriving mathematical models for them. The results extracted from CPF are the same as the well-known rule of thumb for the MAC protocols that verifies our model. We validate our models with the help of simulation study. We also implemented the current CPF model in a web page to make the model online and useful.
Code: The Discrete Event Simulator and the codes (in C++), are located here.