Database users sometimes would like to ask queries that cannot be answered directly by an automated system based on the data in their database, but could be answered via human assistance. Such queries include asking for new data that has not been added to the database yet, and asking for subjective comparisons between data items. Conventional DBMSs do not afford these operations, as they require access to information outside of the database’s context, but a human assistant could bridge this gap.|
CrowdDB is a DBMS extension that uses crowdsourcing behind the scenes, to allow database users to make queries that look up missing data or make subjective comparisons between data tuples. CrowdDB automatically creates tasks for Amazon Mechanical Turk workers to carry out, through auto-generated user interfaces, that let the workers fill in missing tuple values or make binary comparisons between tuples. The results are then combined via majority vote for error correction. Any binary comparisons are merged as needed to produce sorted results, and query results are reported to the user. Crowdsourced answers are cached in the database to save time, in case they are needed again.
One critical part of CrowdDB is a small extension to SQL that is called CrowdSQL. This extension features new operators, CROWDEQUAL and CROWDORDER. CROWDEQUAL is used to determine if two values (typically strings) are equal or not. It can be used, for example, to determine if two names refer to the same entity, in a WHERE clause or equijoin. CROWDORDER is used to rank items based on a subjective quantity, such as relevance to a topic. CrowdSQL also includes CROWD columns, whose values are to be filled in by workers, as needed to report query results.
A major limitation of a crowdsourced database is the lack of guarantees such a system can offer the user. The authors admit that they do not yet understand how to predict the time, cost, and quality of results based on crowdsourcing hyperparameters, such as time of day, number of workers per task, and payment per completed task. Before CrowdDB can be useful, it will be necessary to show that common queries are served in bounded, reasonable time and cost, with accurate results.
The paper talks about CrowdDB, a system which uses human input via crowdsourcing to process queries that a database or a search engine can’t adequately answer. Some queries can’t be answered by a machine. This requires additional information which can be provided by human input, such as missing data from a database or a search engine. It is required while handling unknown or incomplete data and while doing subjective comparisons. It uses SQL both as a language for posing complex queries and as a way to model data.|
Recently, crowdsourcing platforms have been on a rise. The Microtask crowdsourcing platforms such as Amazon’s Mechanical Turk provide the infrastructure, connectivity and payment mechanisms that enable hundreds of thousands of people to perform paid work on the Internet. CrowdDB is to exploit the extensibility of the iterator- based query processing paradigm to add crowd functionality into a DBMS. It provides physical data independence for the crowd, so that there is no focus on which operations will be done in the database and which will be done by the crowd. User Interface Design is a key factor in enabling questions to be answered by people. It presents the opportunity to implement cost-based optimizations to improve query cost, time and accuracy. The architecture of CrowdDB uses CrowdSQL, an extension of standard SQL to go through the Turker Relationship Management, User Interface Management and HIT manager.
The paper provides a comprehensive study of CrowdDB. This is supplemented by favorable experimentation and results. The microbenchmarks done by the authors pictorially represent the observations with relevant experiments done to support the model.
The relevancy of crowd sourcing is increasing day by day but this comes at a cost. The sampling data’s results are biased sometimes and does not provide accurate results. So the quality of answers can be seen as a major issue. As for the paper, the current requirement trend hasn’t been thrown into light much. This is a new area of research and requires a huge amount of work for the idea to be corroborated.
What is the problem addressed?|
CrowdDB provides a crowdsourced query processing systems to integrate the computational power of human and computer.
The assumption that RDBMS made are restrictive as the data creation and use become increasingly democratized. And some queries cannot be answered by machines only: unknown or incomplete data, and subjective comparisons. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating results based on fuzzy criteria. Therefore it’s helpful to use human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer.
1-‐2 main technical contributions? Describe.
- By using SQL they create a declarative interface to the crowd, and strive to maintain SQL semantics so that developers are presented with a known computational model.
- CrowdDB provides Physical Data Independence for the crowd. That is, application developers can write SQL queries without having to focus on which operations will be done in the database and which will be done by the crowd. Existing SQL queries can be run on CrowdDB, and in many cases will return more complete and correct answers than if run on a traditional DBMS.
- User interface design is a key factor in enabling questions to be answered by people. Their approach leverages schema information to support the automatic generation of effective user interfaces for crowdsourced tasks.
- Because of its declarative programming environment and operator-based approach, CrowdDB presents the opportunity to implement cost-based optimizations to improve query cost, time, and accuracy.
1-‐2 weaknesses or open questions? Describe and discuss
“Buffer" for managing crowdsourced answers would be an interesting problem to work on, and answer quality assessment and improvement will require the development of new techniques and operators.
This paper presents a new relational query processing system called CrowdDB, which can use crowdsourcing to answer queries that cannot otherwise be answered. In short, CrowdDB uses human input via crowdsourcing to process queries. CrowdDB uses SQL as a language to process complex queries, and also as a way to model data. This paper also presents experiments with CrowdDB on Amazon Mechanical Turk, which shows that human input can indeed be leveraged to dramatically extend the range of SQL-based query processing. This paper first provides an overview about the problem and design. Second, it presents background on crowdsourcing and AMT platform. Then it moves to the details of CrowdDB, including system design, SQL extension, user interface generation and query processing. It also shows experimental results for Amazon Mechanical Turk. Finally, it provides related and future work.|
The problem here is that many queries cannot be answered by machines only, and processing such queries requires human input such as for performing computationally difficult functions, matching, ranking, and aggregating. RDBMS makes several key assumptions about the correctness, completeness and unambiguity of the data. When these assumptions could not hold, it will return incorrect or incomplete answers to user questions. For example, attribute name might be different for the same thing (IBM vs International Business Machines). Therefore, we need to explore how to leverage human resources to extend the capabilities of database systems so that they can answer queries such as those posed above, and CrowdDB is a good solution.
The major contribution of the paper is that it provides design details and examples for the new design CrowdDB. By doing several performance experiments on different benchmarks, it shows that it is possible to employ human input via crowdsourcing to process queries that traditional systems cannot. Here we will summarize key elements of CrowdDB:
1. simple SQL schema and query extensions
2. new crowdsourcing query operators and plan generation techniques
3. automatically generating effective user interfaces
4. micro-benchmarks of the performance of individual crowdsourced query operators on AMT platform
One interesting observation: this paper is very innovative, and the idea of using crowdsourcing platform is interesting. This paper shows many advantages of crowdsourcing, and it would be better if it can provide some disadvantages of crowdsourcing so that we can know the cases where CrowdDB might not perform well.
The paper introduces us to CrowdDB which uses human input via. crowdsourcing to process queries that neither databases or search engines can answer.It highlights various factors that affect crowdDB which works by negating the closed wall assumptions for human input using SQL as a language for posing queries and providing a process to model data.|
WHAT PROBLEM DOES THIS PAPER SOLVE ?
The paper describes the limitations of RDBMS as it does not provide an output for missing or incorrectly stored data.It provides various examples showing ways in which entity resolution, closed world assumptions and extreme literal properties of relational databases pose problem for real world queries.This problem is solved using CrowdDB which exploits the extensibility of the iterator based query processing to implement crowd functionality and exploit human capabilities to find and compare data.
CrowdDB is implemented using Amazon Mechanical Turk which makes task performance easier for people to perform.It also uses crowdSQL which is an extension to SQL that helps in handling incomplete data and subjective comparison along with user interfaces which lets interaction between the workers and the requester.
Amazon Mechanical Turk:
Micro-tasks in AMT are used by requesters to post a query which are answered by paid/volunteered workers.HIT(Human Intelligent Task) can be replicated to multiple assignments which are odd in number.This benefits when voting for majority.HITs are grouped together based on the requester,title,description and reward.It provides two interfaces for the workers to work on i.e. the AMT interface and the second as the requester interface to capture relationship and reputation with the requester.The Mechanical Turk API is used by requesters to create HIT, get assignments for HIT, approve and reject assignments along with expiration of HITs.
Design Considerations and overview of CrowdDB:
The design of crowdDB is used for understanding the workability of the metaphor describing humans as computers highlighting various factors like performance, variability, affinity, learning and worker pool size for the task which affects parallelism and throughput.It invokes data when not available in storage.The Turker relationship management system maintains the worker pool for HITs provided by requesters along with requester information.The User Interface Management system automatically generates UI for HITs based on SQL annotations and constraints in schema.Using the HIT manager, the API calls to post HITs, assess status, obtain results, interacts with the storage engine, optimizers and queries.
CrowdSQL,User Interface, Query Processing and Heuristics:
STRENGTHS AND DRAWBACKS:
The paper provides numerous examples to make it easier to understand crowdDB and lists many advantages like the use of SQL for creating declarative interface to the crowd, physical data independence, simple and flexible user interface and implementation of cost based optimizations to improve query cost, time and accuracy.It explains the challenges of the cost of unbounded queries, uncertainty of tuples and issues with CNULL and provides solutions to the same such as defining budget, constrains and using LIMIT clause.The drawbacks which I assume would be that the paper does not completely convince me to understand the effects of using a low-expertise result in comparison with the cleaned and validated results from traditional databases.
This paper presents an idea that is dramatically different from everything else we have seen so far in this class. While all of the other papers have been about how to manipulate data using hardware, this paper addresses how you can utilize people to answer queries. Human input can be combined with relational query operators to allow users to writer queries without even knowing if the data they get back is crowdsourced or was already known.|
This work is made possible by the rise in popularity of online "micro-task" sites, such as Amazon Mechanical Turk. These services allow users to pose questions to a (hopefully large) base of users, with a (again, hopefully) high degree of accuracy. The underlying problem that this system attempts to solve is that a database cannot have all of the information it will have need - users will want to do new queries, with new information. To get this new information, crowdsourcing is incredibly helpful. This is because humans are very good at learning new information - they can use Google, read wikipedia, and use logic and reason. People are good at learning new information, pattern recognition, and classification. For example, if you need to know all of the universities that a given professor has worked at, you can pose this question to the crowd. People will then look up the professors home pages, resumes, CVs, publications, etc.,(finding new information), and filter the lists by by jobs are relevant (classification)until they have come to a comprehensive answers.
One thing that I thought was very interesting, that I wouldn't have thought of before, is how the system has to attempt to manage its relationship with the turkers - you have to lean towards accepting jobs, to avoiding angry turkers ranting on forums, and have to pay enough and provide a clean UI with clean instructions, so that turkers actually want to complete the jobs.
While the idea is very cool, there is some inherent latency because of human response times. Each time a query is posed, a component of the system must create the "jobs" on Amazon Mechanical Turk, and then wait for enough users to complete the job so that they answers are correct. This puts response times in the range of several minutes to hours, which means its is not efficient to use in an interactive user environment.
Problem and solution:|
The problem proposed that the relational database can only work when all its assumption are met, or else it fails to provide correct answers. It is because the relational databases are based on “Closed World Assumption” and are extremely literal. It means some queries cannot be answered by the relational database or computer, while it can be easily solved by people. So the solution proposed is to implement CrowdDB to enable people to do the paid work on the Internet. CrowdDB extends a traditional query engine to support human input by generating and submitting work requests to a micro task crowdsourcing platform. Human work can be used to find information that is not well prepared in the database, or compare the data without strict literal standard.
The main contribution of CrowdDB is the implementation of CrowdSQL. This extension supports the use cases that cannot be handled before, like missing data and bad comparison. It allows the users to use CrowdDB in the same way of traditional database. Incomplete data in the crowdsourced table, or it can also be marked as the crowdsourced column. Both the crowdsourced columns and tables can be used in traditional SQL queries. And DML semantics is used to represent the values of crowdsourced columns that has not been obtained with an original value CNULL. For comparison, CROWDEQUAL and CROWDORDER are used to require people’s judgement of equality and ranking. All those comparison results will be cached for future use.
Though the approach is smart to solve some simple but hard to program problem, it still has weakness. The most important problem is that it is hard to control the quality of answers. The level of workers cannot be sured. It is possible to result in a bad answer. Another concern is that it may be very expensive if the dataset is large.
There are many problems that are hard for computers to solve, but easy for humans to solve. Some examples include interpreting the meaning of images, understanding natural language etc. The idea of crowdsourcing is that paying many humans to answer simple questions is more economical than developing computer systems that can do this automatically. Platforms like Amazon Mechanical Turk (AMT) allow requesters to pose these questions to human workers (called turkers), who are paid for each question they answer. CrowdDB is a system that integrates this kind of crowdsourcing into a SQL-based system.
CrowdDB is a relational database where columns or tables can be marked using the keyword “CROWD”. When any column or table marked like this is involved in a query, CrowdDB will attempt to initialize the values of these columns using crowdsourcing. Each tuple with missing information is submitted to AMT as a job, where a turker will try and fill in the missing information. CrowdDB also has its own crowdsourcing operators like CrowdCompare, which asks the turker to compare data based on some (perhaps) subjective criteria. CrowdDB takes in SQL queries, and has its own query optimizer and rewriter which translates the queries to combine static and crowdsourced data. This way, CrowdDB offers logical data independence in that the user does not need to know how exactly the database used crowdsourcing to get the answer.
One strength of CrowdDB is its automatic transformation of sql queries to crowdsourcing tasks. If a user had to set up each crowdsourcing task by hand, this would be an extremely time consuming task.
Querying using crowdsourcing does not always produce correct answers, as people may troll or make mistakes. In addition, people are not robots, and so they may not work for the requester if they had a bad experience in the past.
The cost of a query is not just time- it’s also money in paying turkers. CrowdDB doesn’t yet have mechanisms for bounding a query by time or by money spent.
Also, what is the use case for asking the crowd to fill in missing data? Why not hire people to do data entry? They might give better answers and response time.
The paper proposes a new database system called CrowdDB. It uses crowdsourcing to answer the queries that traditional DBMS could not answer. It is an attempt to overcome the limitation of machines, which rely on closed world assumption.|
CrowdDB uses CrowdSQL, which is an extension of SQL for supporting crowdsourcing. It utilizes Amazon Mechanical Turk (AMT) whenever crowdsourcing is required. CrowdDB has almost everything to utilize crowdsourcing, ranging from specialized query optimizer and user interface generation for AMT. The strength of the paper is that it outlines necessary features for including crowdsourcing in a traditional database well, but I am personally skeptical about such approach.
Crowdsourcing has become a buzzword in the era of big data. People have been excited about utilizing human intelligence to solve problems that machines cannot. Previously, it was impossible due to the lack of an infrastructure to gather and process crowdsourced data. It is now feasible to use crowdsourcing as we can see from an example such as AMT. However, looking at the experiment result in the paper does not convince me that this will be a big thing in the future. How many people will use such system when the query takes minutes to return an answer? How big is the market out there for systems like CrowdDB?
I think crowdsourcing has a fundamental problem with its performance unless it can ask people questions beforehand with some predictive measures. Even the paper mentions that there is a relatively small pool of workers for a given problem. In my opinion, automated approaches with a little to no human intervention, such as deep learning, would have much more applications than crowdsourcing and eventually takes its place completely.
In conclusion, CrowdDB is a nice attempt to incorporate crowdsourcing into a DBMS for a query processing, but I do not think it has a prospect to become popular in the future.
This paper introduces CrowdDB, a database that combines normal query processing with crowdsourcing. There exists tasks that humans are better than computers at performing such as classifying pictures or comparing words of different forums. However, a database has traditionally only used machines as this has limited some of the applications of the database. The authors of this paper extended database functionality by creating a database and SQL extension to take advantage of Amazon's mechanical turk. The idea is that a user can create data in the database and specify certain information such as "pictures" or "urls" to be a crowd type. When queries are performed on crowd types, any values that have not been initialized to a non null value are automatically created as a HIT and assigned to workers. The workers determine a value of the field and the database fills in the information. There is some checking involved to verify its validity.|
I thought this work was a very interesting work and had many interesting avenues for exploration. The first was budgeting and database settings. The paper mentions that currently only the number of tuples returned by a query can be budgeted since it is already built into the SQL standard, but having functionality to also limit response time and money spent or paid out per hit would be useful as well. Furthermore, it would be interesting if there was maybe functionality to limit response time through relaxing the verification checks for hits. This would be similar to accepting approximate answers. The other avenue for further improvements on the system is lineage for data where data that is generate can be tracked back to a worker.
One funny observation the authors made was that their initial verification process was too strict for the worked and, as a result, their jobs started being blacklisted. While it is true that humans can specialize in jobs and favor certain listers, they can also do the opposite and avoid listers. This would be like if a processor decided it no longer favored a user of a machine and refused to work for that user. Such events can't happen in a only computer environment, but need to be considered for mixed environments of human and machine.
This paper presents CrowdDB, a DBMS that is able to utilize human input via crowdsourcing to process queries that are hard to answer without human knowledge involved. However, crowdsourcing also introduces other problems that are not present in traditional DBMS. This paper seeks to solve these crowdsourced query processing systems problems in CrowdDB. |
CrowdDB answers queries using data stored in local tables when possible; otherwise it invokes the crowd to answer the query and stores the answer for future use. There is a requester/worker relationship inside of it. That is, CrowdDB utilizes traditional techniques for relational query processing wherever possible. In addition, the CrowdDB parser can also parses CrowdSQL and has additional three Crowd operators: CrowdProbe, CrowdJoin, and CrowdCompare. The current CrowdDB compiler is based on a simple rule-based optimizer that implements several query-rewriting rules/heuristics.
This paper also introduces CrowdDB's semi-automatic generation of user interfaces from SQL schemas. The generated user interface can be seen as a wrapper of the crowd data sources.
The major contribution of this paper include:
1. It listed some major challenges in crowdsourced query processing systems.
2. It gives some initial solutions to address the challenges and introduces the system design of CrowdDB.
3. It conducts experiments using Amazon Mechanical Turk and presents the results to show the effectiveness of CrowdDB.
Motivation for CrowdDB:|
CrowdDB obtains human input through crowdsourcing and uses SQL as a query processing language and as a way to model data. This is important because some queries cannot be solely answered by machines, and need humans to provide information about missing data from databases and to perform difficult computational functions, matching, ranking, and aggregations on queries with fuzzy criteria. CrowdDB differs from traditional databases in that traditional assumptions conceptually does not involve human input in query processing. It also differs in that human-orientated query operators have to additionally implement soliciting, integrating, and cleansing crowdsourced data. There are also added factors of worker affinity, training, fatigue, motivation, and location in the performance and costs of crowdsourced query processing systems.
Details on CrowdDB:
CrowdDB is built on Amazon Mechanical Turk (AMT), which supports microtasks and provides an API for requesting and managing work to allow users to directly connect to the CrowdDB query processor. By using SQL, CrowdDB creates a declarative interface to the crowd and maintains SQL semantics to keep a known computational model for developers. CrowdDB also provides Physical Data Independence for the crowd in that application developers can write SQL queries without thinking about which operations are done in the database and which the crowd does. Another key factor for CrowdDB is the user interface design, which is generated automatically from the schema information. CrowdDB also allows for the implementation of cost-based optimizations to improve query cost, time, and accuracy. The main challenges of crowd-enabled DBMSs are from the ambiguity of people’s work compared to that of computers, in terms of language comprehension and opinion, and from technical challenges in determining the most efficient organization of tasks submitted to the platform, quality control, incentives, payment mechanisms, etc for specific crowdsourcing platforms such as AMT. CrowdDB proposes a simple SQL schema and query extension that allows the integration of crowdsourced data and processing, provides new crowd-sourced query operators and plan generation techniques to combine crowdsourced and traditional query operators, automatically generates effective user interfaces for crowdsourced tasks, and is able to answer queries that traditional DBMSs cannot.
Strengths of the paper:
I liked that the paper provided a very comprehensive overview of CrowdDB, from the way that incomplete data is dealt with, to the affect of reward on responsiveness, to the automatic generation of the interface. I found it interesting that the accuracy and efficiency of the CrowdDB is so impacted by the user interface, which is very different from other DBMSs that we have studied in class. It was also interesting the see the real world example of the Golden Gate bridge complex query, and the usefulness of the results obtained by CrowdDB.
Limitations of the paper:
I would’ve liked to see more explanation on how spamming and malicious behavior are detected and protected against. I would’ve also liked to have seen the experiments for response time upon varied HIT groups, responsiveness with varied reward, and worker affinity and quality be performed with real-world data.
This paper titled "CroudDB: Answering Queries with Crowdsourcing" introduces the idea that humans can be used for computation of database queries formulated with SQL. This is a very interesting idea, as there are many types of computational problems that humans can easily solve, but which computers still struggle with or cannot compute at all. There is an assumption with classical relational database systems called the closed world assumption which is the assumption that anything that is true is known to be true. This just means that if you write an SQL query to retrieve information you will get the correct answer that is stored in the database. When interfacing with an unknown set of human beings the results may vary and lack consistency.|
This paper is novel in its creation of a database system to use traditional relational database technology and incorporate methods for human query processing. This is interesting because it allows CrowdDB users to answer more types of questions that traditional systems would not be able to answer. Additionally, the automatic generation of user interfaces for Amazon Mechanical Turk makes the process of automatically retrieving answers and for obtaining answers to new types of queries simple.
The approaches presented in this paper have a few drawbacks, however. The automatic user interface generation will not work for all tasks. There are some tasks that will require access to certain types of information. Depending on the resources that crowd workers select, the answers will vary. For some tasks, it will then be important that parts of the interface or instructions to the crowd workers are manually created. Additionally, when this paper was published there was not as much work in privacy of crowd sourced tasks. This could be listed as a criticism of this paper's approach. However, there is more recent research from that past couple of years that shows that it is possible for some types of queries to provide privacy while still leveraging crowd sourcing techniques.
Part 1: Overview|
This paper presents the idea of building database based on human input via crowdsourcing and using this technique to process queries. They bring up CrowdDB, which uses SQL as the language for processing queries and modeling data, is different with traditional database systems, in terms of the uncertainty from the human inputs. Human could get tired, lazy or other conditions where they do not want to work. CrowdDB should solicit human inputs, integrate and cleanse them. In the case of traditional relational database, records rely on solid common assumptions, where the data should be correct, complete, and un-ambiguity. On the other side, people can answer some complex, skewed questions. CrowdDB utilizes the power of people to find new data as well as compare data.
Crowdsourcing platforms actually create a marketplace where requesters and workers trade tasks. In practice, Amazon Mechanical Turk provides micro tasks, which takes no longer than one minute for a person to complete. Human intelligent tasks, assignment, group of HITs are key components of Amazon Mechanical Turk. Issues comes along with Crowdsource query processing, including performance and variability, task design and ambiguity, affinity and learning, small worker pool, and embracing open world.
Part 2: Contributions
This paper analyzes the problem of complex human intelligence tasks which is really hard for a machine to understand. This is a complement to the current traditional relational databases.
Amazon Mechanical Turk structure is analyzed for a real life example. CrowdDB takes into consideration the major issues brought up by crowdsourcing. HIT manager would take care of dividing and modifying micro tasks for workers. SQL is used for DML which is conventional. CrowdDB provides CorwProbe, CrowJoin, and CrowCompare as basic operations.
Part 3: Drawbacks
Humans are likely to make mistakes, and easy to get tired. Results from human processing would be inaccurate. Crowdsourcing based databases should consider the possibility of failure of the query processing. Humans make have their strengths and weaknesses. Workers should be assigned to their best suitable questions or at least something they can accomplish.
We need to make sure there are no malicious workers who would randomly generate results.
The paper discusses about CrowdDB, which uses human input via crowdsourcing to process queries that database systems can not adequately answer. The problem addressed by the paper is that some queries cannot be answered by DBMS software only. They require intelligent human input that fill the missing information from the database while in performing different difficult functions like matching and aggregating results based on fuzzy criteria. CrowdDB extends a traditional query engine with operators that ask human input by generating and submitting work requests to a crowdsourcing platform. CrowdDB is built using the Amazon Mechanical Turk (AMT) platforms. |
The paper proposes SQL schema and query extensions that enable the integration of crowdsourced data and processing. The authors designed CrowdSQL which an extension to SQL to support cases that involve missing data and subjective comparisons. CrowdDB allows any SQL column and any table to be marked with the CROWD keyword, signifying incomplete data that will be handled by human. In addition, to support subjective comparison, CrowdDB has two functions: CROWDEQUAL and CROWDORDER. CROWDEQUAL takes two parameters and asks the crowd to decide whether the two values are equal, whereas, CROWDORDER is used to rank or order results. In addition to schema extension, CrowdDB extends the regular query processor to parse CrowdSQL, by implementing special heuristics to generate plans with the additional Crowd operators. CrowdDB implements a rule-based optimizer.
In addition, the paper discuss methods for automatically generating effective user interface for crowdsourced tasks. The paper identifies that the success of crowdsourcing is the effectiveness of the user interfaces. At compile time, CrowdDB creates templates to crowdsource missing information from tables with CROWD columns. And at runtime, these templates are instantiated to provide information for a particular tuples.
The main strength of the paper is that it addresses a real problem that machines cannot effectively do. At the present machines are good in number crunching but not in image recognition. So it is effective to uses human to assist DBMS to guide some functionalities like the one involving image recognition which can be easily done by human. I like the fact that CrowdDB identifies the division of works between machine and human. In addition, the authors tried to make the crowdsourcing as effective as possible by providing effect user interface, which is important when seeking operation from human.
The main limitation of the paper that there is no guarantee from humans that the answer will be provided in a correct and timely manner. Different factors affect the success of crowdsourcing: availability of human labor, internet connectivity, and other related issues. Consequently, the answer is only approximate if not non-existent. Furthermore, with the advancement of artificial intelligence, it better to seek solution from such system to replace the work required by human. I believe this is the kind of area where AI agents can be exploited. It would have been great if the authors have a section which discuss this possibility. As a result, I don’t see a strong future direction for crowdsourcing using humans.
In this paper, the design of CrowdDB is described. Some queries cannot be answered by machines only. This database system uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. They report on an initial set of experiments using Amazon Mechanical Turk, and outline important avenues for future work in the development of crowdsourced query processing systems.
The major difference is that the database is designed for human-oriented query operators. The situation becomes more obvious where existing systems produce wrong answers when they are missing information required for answering the question. For example, the input might not be exact or the same word as what is in the database. In that case, traditional databases won't be able to provide any correct answer. Some of the tasks are much more expansive to use a computer than for a human to answer. Therefore, CrowdDB is designed for this purpose. The key question to ask then is that how does CrowdDB leverage human resources? Two main capabilities of human are valued: 1. finding new data, 2. and comparing data.
Thus, for CrowdDB, they develop crowd-based implementations of query operators for finding and comparing data. Some minimal extensions to the SQL data definition and query languages are introduced to enable the generation of queries that involve human computation.
Involving human resources in query processing is challenging. Trying to know how we can leverage such resource is the key strength of the paper.
Obviously, the example used in the paper is no longer a problem. ("I.B.M, IBN example") It has been cracked by the search engine and even some input methods. Problem like this, which were expensive to solve by computers will finally be cracked in the future. The key is to establish a platform to involve human in discovering and further solve the problem at the first place.
This paper introduced CrowdDB, a database with crowdsourcing integrated.|
Modern database is fast and efficient when queried under the closed world assumption, but incapable of handling fuzzy queries. On the other hand, handling some query using crowdsourcing might take a lot of work to configure. The main idea of this paper is to design a system that takes advantage of the declarative SQL and crowdsourcing. This enables fast processing for normal local queries, and hides a lot of unimportant details of crowdsourcing handing when dealing with crowdsourcing queries.
To achieve this goal, a key word CROWD is added to standard SQL semantics. When a table is built with this key word, it means this column is reserved for crowdsourcing to answer. DML is also different is this database. Query writer can specify interface for crowd sourcing query. When this query gets processed, any crowdsourcing column will be transformed, and answered using a crowdsourcing platform. Then the answer returns to the user.
This paper is the first one to consider answering fuzz data questions using crowdsourcing in database. It is very innovative and interesting.
I think crowdDB lacks of a mechanism for query writer to specify the how many assignment needed for a HIT. Different query might need different assignment count to achieve required correctness.
Another concern I have is the time. As mentioned in this paper, when a column is marked as CROWD, the time to do this query could be very long. For a database, people’s expectation is to get an answer very quickly. So I wonder will this product have a market share.
CrowdDB is the solution to queries that cannot be answered by database systems adequately, by using human input to get a qualitative answer. The system uses SQL as the language to query the database however, if there are missing values or if the queries are subjective, the system has provided the opportunity to use CrowdSQL, an extension of SQL to enable crowd sourcing when required.|
The queries are sometimes written in such a way that CrowdSQL enables the tables to be updated as a side-effect of crowdsourcing. If the query only required the telephone numbers of the professors in the Math department, crowdsourcing enables the ability of the workers to be able to update the table with math professors who are in the department but their details were not updated in the table. One of the major pros of crowddbs come from subject comparisons. One of the queries that exemplifies this is the query that asks for the image that best represents the Golden gate bridge. This is definitely not something the system is capable of doing on its own and having humans intervene is the best way to get an answer in comparison to no answer.
One of the advantages of CrowdDb is that the application developer can specify a bound in terms of time or quality depending on whether the time within which the results are obtained is important or the quality of the same. Since CrowdDb has the capability of linking to crowdsourcing when required, it provides the opportunity to build a community of requesters and workers who can together source and guarantee knowledge to help this going. Since, CrowdDb is a proper database system, it has the capability of optimizing a query based on cost, time and accuracy which is a possible plus.
One of the disadvantages of CrowdDb is that; for a crowd query, you probably have to always assume a specific amount of time where the workers can actually answer the query in a qualitative manner. Even after that, sometimes you may not have the guarantee or the majority for the value of a given column. At this point, the paper specifies a probe has to be sent out again which I can see as an additional cost that might sometimes seem not worth the effort. Since, there is a possibility that even after the majority the answer may not necessarily be the perfect one and does require requester intervention, it may sometimes seem like extra effort that could have been avoided.
Overall, CrowdDb seems like an innovative answer for specific use cases where human input forms a major part of the requirement and can provide the kind of quality that is missing in typical DBs.
CrowdDB: Answering Queries with Crowdsourcing review|
In this paper, the author introduced the new challenge for RDBMS, which is, traditional RDBMS makes strict assumptions about the correctness, completeness and unambiguity of the data being stored in the database, and cannot perform computationally difficult functions like matching, ranking, or aggregating results based on fuzzy criteria. Those queries cannot be answered by the machine only, since the the machine can only work in a closed world of information. For example, to execute the following query:
SELECT profile FROM department WHERE name ~= "CS";
The machine will need to be able to identify the possible name for the computer science department, which can be represented as “michigan state university electrical and computer science department” or “school of computing and information technology”. However, the current AI are not as good as the real human, the best solution is still to resort to human assistance in crowd source to answer those questions.
The major contribution is that author proposed the idea of building an sql like DML extension to allow the DBMS to utilize human input to help processing those challenging queries. As a result, CrowdDB is built to demonstrate the usefulness of this idea. CrowdDB is an application issues requests using CrowdSQL, a moderate extension of standard SQL. CrowdSQL supports user defined crowd table and crowd column, and in those domain, it also provides two subjective comparison function in queries: comparison and order. CROWDEQUAL takes two parameters (an lvalue and an rvalue) and asks the crowd to decide whether the two values are equal. And CROWDORDER is used whenever the help of the crowd is needed to rank or order results. The performance test result are quite inspiring: to sum up, the experiments involved 25,817 assignments processed by 718 different workers. Overall, they confirm the basic hypothesis of our work: it is possible to employ human input via crowdsourcing to process queries that traditional systems cannot.
However, there are also some weakness in this paper, first of all, the query latency. As is known that all the crowd sourced HIT assignments have to be posted to the end worker and wait for their reaction, this means even for those queries of tiny size, programmer will have to wait for minutes to get the results. In this perspective, the crowdDB would not be able to support survive in the OLTP market by any chance. On the other hand, if the crowdDB can only support small OLAP jobs with long reaction time, there would be no necessity to even integrate them into the sql DML. Because the system capacity of data processing is too small , it would be easier to just create some online query form. The second drawback for the design of crowdDB is that, the incapability of human. In many aspect, you can’t just assume that all of the end workers are as good as the well paid programmers or data analysts, their knowledge are always limited. Even with the help of fault detecting, the chances of all workers giving bad answer are high. Hence there is a need to provide a mechanism to evaluate the hardness of the breakdown of a single HIT, if it is too hard, the end workers are less likely to give good answers. And in that case, the system should reject the execution of the doomed queries.
The authors investigate the problem of using crowdsourcing to fill in missing database entries and complete database operations (comparisons and joins). Unlike machines, humans are able to solve problems based on non-literal criteria. This is important as in many cases, data is not properly sanitized or normalized, and so humans are better suited to make comparisons than machines. CrowdDB extends SQL to allow for seamless data completions using crowd-provided data and allows users to use crowds to run comparisons and join operations. CrowdDB automatically generates user interfaces so that crowds can answer queries, but allows users to override the default forms. CrowdDB faces a number of challenges: it must deal with malicious crowds, crowds take longer times to respond to queries than machines, and often only a subset of queries made to crowds receives responses.|
The authors successfully identified that by adding a few keywords to SQL, database users could easily take advantage of crowds for running database queries and providing missing information.
Though CrowdDB simplifies the process of using crowdsourcing, there is room for improvement. The existing user interfaces are very literally taken from the database query. Instead, more advanced UIs could be used to receive better answers from users. As we saw with BlinkDB, approximation can greatly improve performance, and as crowds are unlikely to provide exact responses, certain queries could likely take advantage of approximation for improved performance. This would have an especially noticeable impact in crowdsourcing as response time tends to be long.
This paper introduces a new way to answer queries using crowdsourcing called CrowdDB. There are certain queries that we cannot answer with traditional relational databases. For example, if we wanted to query all of the pictures in a database where the picture is of the Golden Gate Bridge, the database would not know how to answer that query. However, human beings would not have a problem answering that query. The only problem is time. Another example of a query a computer may get incorrect is selecting all of the rows where the company name is Intel. There may be errors with how the company name was inserted into the database or the name may have been entered as Intel Inc or Intel Incorporated.|
CrowdDB introduces a crowdsourcing technique that can answer these queries correctly using human power. It modifies SQL so that it is able to declare queries that involve crowdsourcing. Once a column is declared as needed to be crowdsourced, any query on that column will go through CrowdDB’s executor and determine that a new crowdsourcing job needs to be created. If the results were already crowdsourced, it is cached and used in future queries. To create crowdsourcing jobs, CrowdDB still creates a query plan so that the user interface for the crowdsourcing question can be created.
I believe the following are the positives with the paper:
1. It explains how queries and crowdsourcing interact and how the crowdsourced data is cached and used in the future.
2. The authors outline all of the key words and phrases that we need to know to know how their crowdsourcing interface works.
Even though the paper presents a new approach to answering queries and has a compelling argument as to why we need this new approach, I still think there are some weaknesses with the paper:
1. It does not describe which subset of people the crowdsourcing goes to. Does CrowdDB just randomly pick a certain number of people to answer the question about that query?
2. What if the majority of the population does not know about what is being queried? For example, some people may not know what landmarks are in which cities, so people may not get the correct answer for some queries.
This paper discusses CrowdDB, a database that uses Amazon Mechanical Turk (AMT) to answer queries that cannot be easily answered by computers. Relational databases operate under the Closed World Assumption. Anything that is not in the database in the proper format is considered to not exist. For example, if a user wanted to select the name and phone number of all professors in the math department at the University of Michigan, a database would return a null value for any phone number not entered in the database. While this phone number almost certainly exists, it is nearly impossible for a computer to reliably search the web and find this information. However, this simple lookup is a trivial task for a human. CrowdDB leverages this principle by allowing users to request certain information as requiring an answer from a crowdsourcing platform such as AMT.|
CrowdDB extends SQL to allow for several crowdsourcing operations. Users may tag columns or tables as crowd columns or crowd tables. When tuples are created without a specific value for a crowd column, a special CNULL value is inserted rather than the standard NULL. This denotes that the value should be obtained from the crowd when that tuple is referenced for the first time. In order to obtain the value, CrowdDB uses AMT’s API to create a Human Intelligence Task (HIT) that will be posted for workers to satisfy. For example, if a phone number is missing from a tuple, as in the example above, a HIT will be created that lists the currently stored information for the professor (e.g. name, department, university) and asks the AMT worker to enter the remaining information (e.g. phone number, office, etc.). Workers are typically paid a few cents for completing a small group of tasks such as this.
CrowdDB also adds several crowd operators to standard SQL syntax. For example, CrowdOrder creates a HIT that presents workers with several values and asks them which better fulfills some given criteria (e.g. which of these is a better picture of Michigan Stadium). This is typically used for tasks such as image recognition or qualitative ordering that are difficult for machines but easy for humans. CrowdEqual functions in a similar manner, presenting users with two values and asking whether or not they are equivalent. This leverages the fact that humans are much better at natural language processing (typically recognizing synonyms or abbreviations in CrowdDB applications) than computers
My chief concern with this paper was that they didn’t really address how expensive these queries are and how that is made visible to the user. Although AMT workers are typically paid only a few cents per HIT, this can add up when a query touches large amounts of data. The authors note that this cost must be exposed to users somehow, but I was left wondering how this cost will factor into applications other than ad hoc queries. For instance, if a user of some software built on CrowdDB types something into a search bar and triggers a crowd operator, they are likely not aware that this has a cost associated with it and that the query could take several hours to complete. This is something that will require much thought if CrowdDB is to become a viable option for anything other than ad hoc queries.
This paper introduces CrowdDB, which uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. In many situations, the queries cannot be answer correctly by machines. Thus, CrowdDB uses human input through crowdsourcing to answer the queries more correctly by the help of human.|
First, the paper talks about Crowdsourcing. A crowdsourcing platform creates a marketplace on which requesters offer tasks and workers accept and work on the tasks. Amazon Mechanical Turk (AMT) provides an API for requesting and managing work. The design considerations of CrowdDB include performance, variability, ambiguity, open or closed world and so on.
Second, the paper talks about the overview of crowdDB. An application issues requests using CrowdSQL, a moderate extension of standard SQL. CrowdDB answers queries using data stored in local tables when possible, and invokes the crowd otherwise.
The strength of the paper is that it describes CrowdDB in details, including the motivation of human-input queries, implementation, and experimental results. This flow can make readers understand the ideas of CrowdDB.
The weakness of the paper is that it provides few examples when introducing the features of CrowdDB. I think it would be better to illustrate the ideas through some real query examples.
To sum up, this paper introduces CrowdDB, which uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer.
This paper is an introduction to CrowdDB, which is a crowdsourced database built on top of Amazon Mechanical Turk (AMT). The need for crowdDB is two fold; doing human comparisions (eg. what picture is most relevant) and operating under an open world, not closed world, scenario where missing data doesn’t mean it is false or non-existent. |
CrowdDB uses 3 main things from AMT that are important to understand how it works.
1. HIT = Human Intelligent Task. This is the smallest entity of work a worker can accept to do (eg. vote which picture is better)
2. Assignment. Every HIT is replicated into multiple assignments where each assignment will make sure a worker is only assigned it once. This is so that if you have someone who doesn’t respond one of the other people with the HIT assigned will. This is akin to how MapReduce will use replication as a safety if one thing goes down, this is if one worker is not working the other workers will complete the HIT.
3. HIT Group. AMT groups HITS into a grouping of similar types so that hopefully each worker will do a HIT group of something they are good at or interested in.
CrowdDB adds a few things to SQL syntax the CROWD keyword which can be used on any column or table to indicate it should be crowdsourced. It also introduced CROWDORDER and CROWDEQUAL which are for ordering queries. Because of these additions the paper has addressed the new query processing that is needed to combine both the crowdsourced results and whatever standard DB results are obtained. One other quick tidbit about CrowdDB is it automatically generates GUIs for the crowdsourced queries so that the humans can easily respond.
I did have a few complaints about this paper of problems that I do not think they addressed that I would have liked to see. First off, I was not convinced about the need for crowdsourcing some of the queries they did. Some of them felt like standard information that could be stored in a DB and that failing to convince me significantly took away from what I envision the use of CrowdDB to be. It is still definitely an important technology that is great but I think it is useful for a few specific types of queries, not a very wide range of things. Secondly, they failed to address that query response time is relatively slow and very dependent on HIT assignment and worker completion. They slightly touched on it with changing rewards to incentivize workers to work faster but that didn’t’ have particularly encouraging results in my opinion.
Overall, I think this was a fantastic paper. It was easy to read and a very important technology that they did a great job of explaining! There were few downsides that I mentioned above but in general I think this is a very interesting technology and I feel I understand it now! Definitely worth the read.
One of the fundamental limitations of database systems is the fact that they can only give answers about data that they have, in response to a query that is specified and tailored to the database. One approach is to use algorithmic means to approach a reasonable answer (i.e. semantic approximation based on input text), but this cannot achieve the accuracy of human interpretation (which is limited by sheer resource availability). For example, if somebody queried for a company name “Mecrosoft” in a database, it is obvious what the answer is, but a traditional RDBM would return nothing (i.e. entity resolution problem). In addition, there may be information that is not readily available to the database but is easy to look for in other forms (i.e. closed world assumption) such as search engines or other data sources. One solution to this has been Amazon’s mechanical turk, a system designed to distributed and make available human intelligence for database computing. They have exposed an API in order to make automating human intelligence tasks even more manageable and easy.|
Bearing this in mind, their design concerns for a database which crowd-sources query responses involved the following: accounting for performance and variability in effectiveness of AMT workers, between individuals and for a specific individual over time. In addition, the manner in which problems are presented can introduce ambiguity or subtle variations in response style. While computers are effectively interchangeable, workers are human beings that may choose to specialize in certain types of tasks, and have the choice of approaching one task over another. Lastly, availability of workers is limited, while availability of information for “open world” scenarios is essentially limitless.
Most of the implementation of crowdDB is built on CrowdSQL, a crowdsourced SQL extension; this provides a CROWD* keyword in order to interface a crowdsourced query response (crowdsourcing new records or entries if they do not exist, using human intelligence to compare non-intuitive values, etc). To explicitly make a crowdsourced response, a minimal extension to SQL is needed. However, since repeated crowdsourced queries can be expensive, they implement caching for query responses so stored responses can be used whenever possible (note that this is not necessarily valid since answers can change based on query history and cache behavior). Regarding the incomplete data problem, they give a discussion on the importance of a good UI for AMT workers; a clear UI is important for fast response times and quality of response. By leveraging the strict structure of SQL schemas, they were able to auto-generate interfaces for filling out response information as well as suggesting data for incomplete data entries. The UI was also able to express multiple relations in order to recursively complete data. The last large part of their system was their query processing system. This part was a bit confusing to me, but the key insights were they use “insightful query language” techniques (e.g. bounding disk I/O operations while retaining the ability to express many different queries) in order to address the open world problem. Their description of their query optimizer is similar to a traditional SQL query optimization, where a logical plan is created and reorganized/optimized with relevant CROWD operations (CrowdProbe, CrowdJoin), before it is formatted into a physical plan.
They had a few interesting graphs in their evaluation, some of which were more intuitively obvious (i.e. faster response times with more AMT workers in a group) than others. One thing I thought was interesting was the gap in “motivation” for completion rate from changing AMT compensation from 0.01 to 0.02/0.03 and then to 0.04. However, I wish they gave more comprehensive analysis on the actual quality of their query results (i.e. showing some complex queries and how the DB handled them vs. the AMT workers). In addition, the motivation for a crowdsourced seems kind of weird, except as a proof of concept; who would want to pay so much out of pocket for keeping a DB like this running? I can see, however, using CrowdDB as a bootstrapping mechanism for filling incomplete data on a database before it can take off on its own. Otherwise it is an interesting concept and I can see how this can be a continued area of research in the future.
The paper explains the design and operation of CrowdDB, a database that uses human input via crowdsourcing to process queries. The motivation behind the paper is because there are information/data that neither database systems nor search engines can adequately answer by utilizing two main capabilities of human computation: finding new data and comparing data. Therefore, why not try integrating human and database system? |
The paper starts with explanation on CrowdDB and Amazon Mehanical Turk (AMT). The approach taken by CrowdDB is by exploiting the extensibility of the iterator-based query processing paradigm to add crowd functionality into a DBMS. The paper addresses two challenges in building crowd-enabled database: differences in how people work compared to computers and finding the most efficient task organization. CrowdDB is built using AMT, which support microtasks. In AMT, the smallest task unit is HIT, which can be replicated into ASSIGNMENT, and similar HITs can be grouped into a HIT group. The paper also explains a bit about AMT’s APIs. CrowdDB design considers performance & variability, task design & ambiguity, affinity & learning, relatively small worker pool, and open vs. closed world. CrowdDB architecture consists of Turker Relationship Management, User Interface (UI) Management, and HIT Manager. Its SQL extension, CrowdSQL is quite simple, both in execution and implementation. The paper then continues with user interface generation, which can vary according to the structure of the query and the schema itself. Next, it explains query processing in CrowdDB using Crowd operators (CrowdProbe, CrowdJoin, and CrowdCompare), the physical query plan generation, and the heuristic. The paper explain the experiment using CrowdDB done on Micro-benchmark (for simple query) and complex query. Two lessons are learned: (1) crowd resources involves long-term memory that can impact performance (requester can track worker’s past performance and vice versa) and (2) UI design and precise instruction matters. The paper mentions several related work, including traditional query processing, semi-automatic UI generation, quality control&response times, and the previous usages of crowdsourcing in relational query processing.
In short, the main contribution of this paper is it shows that it is possible to integrate human and database system to work together in answering queries that is impossible to be answered by machine alone (i.e.: comparison queries and queries that have no data to begin with). The paper shows how it is done through proposing simple SQL schema and query extension to integrate crowdsourced data and processing. The paper also explains the design and operation of CrowdDB in detail, including describing methods for automatically generating effective UI for crowdsourced tasks.
However, I am very interested to see if changing the complexity of the problem or doing the task collectively will affect the quality of answers given. Crowdsourcing can be very useful, but how useful? Finding new data and comparing data is something human do well (compared to computer), but it is very interesting to see how far crowdsourcing can help. Also, while the tasks used in this paper are simple tasks, comparing quality of the answer when the tasks are done collectively should also help to see whether certain type of task is done better collectively.--
The purpose of this paper is to introduce a system that combines crowd-sourced data with a traditional DBMS to be able to answer queries that cannot be answered by a simple DBMS or source-engine due to missing data or required human input. |
The main technical contribution of this paper is CrowdDB and the components associated with it. One of the major components is CrowdSQL, which is an extension of standard SQL that incorporates features for crowdsourcing in specific operators. They add a “CROWD” keyword to the SQL DDL to allow for this crowdsourcing to be incorporated. This keyword can be used to annotate specific columns in a table or the table as a whole. Additionally, this paper adds two new built-in functions to CrowdSQL called CROWDEQUAL and CROWDORDER to allow for subjective comparisons to be made by human users. These features are implemented with new SQL operators that specifically solicit information from the crowd. Another main contribution of this paper is their automatic generation of user interfaces for crowdsourced tasks created using schema information. This automatic generation relies on database-user-created information stored as part of free-text annotations of CROWD tables and columns as well as information in the database schema such as data types and names.
One of the main strengths of this approach is that existing SQL queries will run on CrowdDB with no necessary edits. Your existing queries may even return better results if one of the operators has been re-written to utilize data from the crowd in CrowdDB. This is essentially better results for free! (Are we guaranteed that the results will not be worse?). Another strength of the system is that it stores information gathered from the crowd in the database for future use.
As far as weaknesses go, I was a little thrown off by the long discussion about Amazon Mechanical Turk at the beginning of the paper. Giving background is always good, but I found the discussion of the specific AMT function calls that they use to be overboard and unimportant.
Paper Review: CrowdDB: Answering Queries with Crowdsourcing|
Traditional relational databases have a lot of limitations. One major limitation is that it has a “Closed World Assumption”, which means it assumes that those information that do not exist in it are either false or non-existent. With such assumption, a lot of queries cannot be answered corrected. This paper proposes a novel method to enhance the query answering ability of a DBMS by taking advantages of crowdsourcing. The major idea is to adjust SQL to make a suitable interface so that AMTs can easily interact with the DBMS and provide additional information’s to help processing the queries.
Although the idea presented in the paper sounds interesting and novel, it doesn’t seem to be very practical not to carry much of technical novel.
Utilization of crowdsourcing is a hot topic.
Introduction and related work are quite lengthy and wordy. SQL and crowdsourcing are really not unfamiliar concepts to any literature.
Not much of advanced technique introduced: despite the fancy words and attractive descriptions, after all the paper is proposing a task that is quite straightforward. The proposed methods is in a word use the crowd as an additional database from where richer information is expected to be extracted to answer queries.
The paper does not clearly address the issue of unpredictable query processing time.
Utilizing crowdsourcing is very domain-limited; it is not quite possible to answer a query in a specialized field by asking the crowd who are simplify not trained to do so.
Conceptually, I don’t find much of a difference between the work proposed in this paper and simply posting a question on question.com or simply just Google your questions, say, as “I.B.M” and I’m pretty sure you will get what you want no matter you type “I.B.M” or “International Business Machine” as used as an example in the beginning of the paper.
This paper introduces CrowdDB, a crowdsourcing database system that utilizes the human factors can answering complex questions that traditional database cannot answer. This paper identifies two different questions that are particularly hard to answer in traditional database system: 1. A query may not have answer even without mistake, i.e. synonymous in queries are hard to identifies by database without previous awareness. 2. Conceptually Ranking problem are hard to answer. In order to solve this problem, crowdDB utilizes the crowdsourcing infrastructure in amazon to incorporate human factor to solve the problem. It defines Crowd column and crowd tables to provide physical data independence, it provides CrowdProbe, CrowdJoin, CrowdCompare operators to incorporate human factor into logical plan and physical plan of database. Moreover, this paper also introduces a interface generator to generate corresponding interface for human to interact with the system.
1. This paper gives simple extension to SQL language to specify data that need to be retrieved with crowdsourcing, including CROWD and CROWDORDER keywords. This simple extension forms an interactive interface that provides all the data independence for frontend users.
2. This paper defines 2 crowd data source with 3 crowd operator to provides all the data independence for backend compute engine.
1. One big problem for crowdsourcing database is it is hard to get bounded time results. By incorporating human factor, we also introduce instability to the database system. Although a budget is mentioned to constraint the query time, it is worthy to have further discussion in the techniques to constraint the query time.
2. This system relies on a third-party crowdsourcing infrastructure instead of its own infrastructure, which means it has less control over the system. Extra errors may be introduced by the third party crowdsourcing infrastructure.
This paper introduces SciDB, a DBMS system specifically designs for solving scientific problems. It utilizes multiarray to represent multidimensional model to provides data representation for scientific applications. This paper also does a complete survey to show what scientific database requirements in both common applications and also for specific areas. This paper also defines 2 kind of operators to support database operations: 1. Structure operators: Subsample, reshape, etc; 2, content dependent operators such as filter, aggregate and content-based join, etc. Moreover, it also shows that user can define their own database operators and also data types. In the end of this paper, it also shows that it plans to provides some new benchmarks for scientific database in the future.
1. This paper does a great survey on requirement of scientific database system, it shows
2. This paper defines two kind of operators that incorporates two data sources to provides backend solutions for a variety of applications.
1. One of the major weakness is that this paper does not provide a complete evaluation on its SciDB system, however a benchmark is developing in the future development as the paper writes.
2. A lot of the work hasn’t been done when the paper published. Although it is exciting to see SciDB can solve so many problems in scientific computing, it is not clear at this stage it can really archive that.
This paper introduces CrowdDB, a RDBMS that integrates traditional DBMS and crowdsourcing. The traditional DBMS has closed-world assumption for query processing: information that is not in the database is considered to be false or non-existent. Unlike the traditional DBMS, CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. Human is good at finding new data and making comparisons that are difficult or even impossible to be encoded in computer program. CrowdDB leverages crowdsourcing for finding and comparing data, while relying on traditional relational query operator to do data manipulation and processing. The authors proposed simple SQL schema and query extensions to enable the integration of crowd sourcing. In the SQL DDL extension, users can specify either an attribute or an entire table to be “CROWD”. Such data elements are set to be CNULL by default. Once any query touches CROWD element, CrowdDB will translate the query operator into corresponding crowdsourcing request and fill in the blank with result. Another SQL extension is adding subjective comparison. For the queries that is hard to be answered by people, CrowdDB can call crowdsourcing services to answer such queries. For example, a query of "Which picture visualizes better Golden Gate Bridge" will be accepted and sent out for crowdsourcing.|
User interface is extremely important to the query quality and response time in crowdsourcing. So CrowdDB also provides a few UI templates for application developers to use. CrowdDB implements all common operators of the relational algebra and a set of Crowd operators that encapsulate the instantiation of user interface templates at runtime and the collection of results obtained from crowdsourcing. The implementation of traditional operators need not be changed. Only a few new crowd based operator need to be implemented. Currently CrowdDB implemented CrowdProbe, CrowdJoin and CrowdCompare.
The main contribution of CrowdDB is to integrate the crowdsourced data and processing into traditional RDBMS. It minimized the change to RDBMS, namely the original relational operators need not to be re-implemented. By applying a few simple SQL extensions and implementing crowdsourcing related query operators, CrowdDB successfully brings the human’s processing power into DBMS.
One weakness of this paper is the high response time of crowdsourced query. It’s obvious that crowdsource has high latency as the job will be submitted to a job tool and wait for people to pick them up. Even if the job could be quickly distributed to a human, it takes seconds or even minutes to finish the task. These explain why in the experiments the response time is in magnitude of minutes. Compared to the response time of traditional DBMS, which is several millisecond, the high latency of CrowdDB will large limit its usage.