This paper introduces l-Diversity, a privacy beyond k-anonymity. The paper basically answers three questions: 1. Why is k-anonymity not good enough? 2. How does l-Diversity provide better privacy beyond k-anonymity? 3. How can l-Diversity be implemented efficiently? The paper first shows how someone can attack k-Anonymity with background knowledge (Homogeneity attack), which suggests that k-anonymity lacks diversity in tuples. Also, k-Anonymity does not protect against attacks based on background knowledge. As a result of these drawbacks, a stronger definition of privacy that takes into account diversity and background knowledge is needed. Then the paper introduces the approach of l-Diversity in detail, whose main idea is requiring the values of the sensitive attributes to be well-represented in each group. By analyzing the advantages of l-Diversity and weakness of Bayes-optimal privacy, several key insights are provided: 1. l-Diversity does not require knowledge of the full distribution of the sensitive and nonsensitive attributes. 2. l-Diversity does not require the data publisher to have as much information as the adversary. The l-parameter protects against more knowledgeable adversaries, which is good since data publisher should not make assumptions on what adversaries know. 3. Instance-level knowledge is covered, to rule out possible values of the sensitive attribute. 4. l-Diversity simultaneously protects against different adversaries with different background knowledge leading to different inferences. The implementation approach of l-diversity is briefly covered. Based on l-diversity's monotonicity property, the way to generate l-diversity algorithm based on k-anonymity is presented. Compared to k-anonymity, l-diversity protects against background knowledge and homogeneity attacks. Compared to Bayes-optimal privacy, l-diversity can be more efficiently implemented with the property of monotonicity. I think the result of l-diversity is very promising, providing very strong protection to the data, acquiring not much change to the current algorithm data publishers are using. The part I like about this paper is, it uses very easy-to-understand examples to illustrate the shortcomings of k-anonymity algorithm, making that part easy to read. Another pro of this paper is, it lists the advantages of l-Diversity after using long math definitions to introduce the details of l-Diversity, which help me get a high-level idea of the algorithm's advantages even if I didn't follow all the math backgrounds. However, when introducing l-Diversity, the paper inevitably introduces a lot of probabilistic materials which are hard to understand. |
k-anonymity is a popular definition in data privacy field because many groups and organizations need to publish personal information of individuals while protecting their privacy. A k-anonymized dataset can ensure that each record is indistinguishable from at least k-1 other records w.r.t the identifying attributes. However, current k-anonymity method is not safe enough. The paper shows two possible attacks towards k-anonymity. First attack method is that the attacker can find the value of sensitive attributes when they are not diverse enough. Second attack method is that attacker with background knowledge can break the protection with guessing and background knowledge out of the dataset. The paper analyses the two kinds of attack and proposes a more powerful privacy definition called l-diversity. Some of the strengths and contributions of this paper are: 1. The development of l-diversity definition from Bayes-optimal privacy makes the policy more practical. 2. L-diversity doesn’t require the data publisher know anything about the background knowledge that the attacker has. 3. The paper implemented the l-diversity and conducts reasonable experiments to compare the performance with k-anonymous tables. Some of the drawbacks of this paper are: 1. Some of the principle the paper make doesn’t have strong supports. For instance, the assumption that the attacker usually has better background knowledge than the data publisher is not so convincing. 2. The dataset in the experiment is not diverse enough. 3. Current l-diversity can only protect single sensitive attribute. |
It is important for organizations to provide strong privacy guarantees on their published data without revealing sensitive information. One of the most popular ways is k-anonymity, which, however, cannot survive two simple attacks: homogeneity attack and background knowledge attack. Therefore, this paper proposed a new definition of privacy and a novel solution with stronger privacy guarantee, which is called l-diversity. l-diversity provides privacy even when the data publisher does not know what kind of knowledge is possessed by the adversary. The main idea behind l-diversity is the requirement that the values of the sensitive attributes are well-represented in each group. l-diversity was proposed to conquer the limitations of k-anonymity. As an extension to k-anonymity, l-diversity can ensure data privacy even without identifying the enemy’s background knowledge to avoid attribute disclosure. This approach revolves around the notion that the sensitive attributes in each group are “well-represented”. This technique is a modification of k-anonymity by incorporating the k-anonymity principle. The main contributions of this paper are as follows: 1. This paper showed that k-anonymity is susceptible to homogeneity and background knowledge attacks; thus a stronger definition of privacy is needed. 2. This paper introduces an ideal notion of privacy called Bayes-optimal for the case that both data publisher and the adversary have full (and identical) background knowledge. 3. This paper proposed a novel practical definition of privacy level, l-diversity, and provided a way to implement it efficiently. The main advantages of l-diversity are as follows: 1. It provides a greater distribution of sensitive attributes within the group, thus increasing data protection. 2. It protects against attribute disclosure, an enhancement of k-anonymity technique. 3. The performance of l-diversity is slightly better than k-anonymity due to faster pruning by the l-diversity algorithm. However, there are some limitations of l-diversity: 1. l-diversity can be redundant and laborious to achieve. 2. l-diversity is pone to attacks such as skewness attack and similarity attack as it is inadequate to avoid attribute exposure due to the semantic relationship between the sensitive attributes. |
Problems & Motivations Releasing database to public benefits many researches. However, it may also introduce the problem of privacy. The database contains some sensitive attributes that you don’t want others to know; yet some people may infer the tuple who represents you by the insensitive data, and therefore, getting your sensitive data. The traditional wisdom to measure the privacy level of a database is called k-Anonymity. It means: A table satisfies k-anonymity if every record in the table is indistinguishable from at least k – 1 other records with respect to every set of quasi-identifier attributes. However, a table achieve the k-Anonymity may still be under the risk of leaking sensitive information and the k-Anonymity does not guard the privacy enough. For example, there are two attacks on the k-Anonymity. The first one focuses on homogeneity. In the simple words, if the group that the insensitive data within is indistinguishable from each other have the same sensitive data, then people can still get the sensitive data of volunteers. The second on focuses on background knowledge. If the attacker has some background knowledge, then he/she can use the knowledge to eliminate diversified choices. Main Achievements: The author first abstracts the model. The privacy problem is nothing but a probability problem. Given a table T = {t1, t2, …, tn} and a set of non-sensitive attributes Q (quasi-identifier) where the attributes within it can be linked with external data to uniquely identify at least one individual in the general population, how can we process the data in a way that can make the p(t[S] = s| t[Q] = q & background knowledge) as average as possible. The model makes the assumptions on attackers. 1. They have the ability to access the published table. 2. They have the “instance-level background knowledge”. 3. They have the “demographic background knowledge”. The principle (or the privacy level) the author proposes is l-diversity, which is stronger than the k-Anonymity because it not only requires that the record should have l – 1 indistinguishable records, but also every records should have different sensitive values. Drawbacks: 1. The paper seems have many unnecessary mathematical expressions. For example, the equation (3) n(q*, s’) << n(q*,s) only express that almost all tuples have the same value s for some sensitive attribute S. In plain English, it will be easier for people to understand. 2. The key insight behind the paper is straight forward and I am confused what is the purpose of the author to construct the model of posterior belief, I understand that this is the final value we want to minimize but, is there correlations with the result/principle we get? |
As the amount of data being generated and collected continues to rise at increasing rates, so too does the motivation to conduct data analyses in order to generate meaningful insights. At the same time, however, there remains the challenge of being able to preserve the anonymity of the individuals, especially when it comes to areas such as health care records without obfuscating the data too much and making it difficult to learn useful insights from the data. This problem has been intensely studied, and one widely adopted approach has been the concept of k-anonymity, where a given record is indistinguishable from k-1 other records. This method is still vulnerable to a variety of attacks, which is exacerbated when attackers possess some background knowledge prior to attempting to identify sensitive attributes. In light of this, the paper’s authors analyze the common factors behind several simple attacks on k-anonymity and introduce a new method called ℓ-diversity that aims to solve many of its shortcomings. As previously mentioned, k-anonymity is susceptible to a variety of attacks, two of which are presented by the paper. The first is called a homogeneity attack, in which all k indistinguishable members all have the same sensitive attribute, allowing attackers to easily attain the desired information (the sensitive attribute). The second is the background knowledge attack, where attackers can use external knowledge to rule out or assign higher probabilities for the sensitive attribute based on the values of the other attributes in the k-anonymous group. Based on these types of vulnerabilities, it would desirable to have some sort of method that results in outputted data which provides the attacker with very little additional information given their background knowledge, i.e. the Bayesian prior and posteriors should not differ by much. This is a difficult goal to achieve since the designer often does not have full knowledge of the probability distribution, nor do they know the full extent of the adversary’s knowledge. The authors’ ℓ-diversity method tries to create a situation such that for each q block (blocks of k members with the same set of non-sensitive attributes), there are at least 2 or more different sensitive attributes, such that the ℓ most frequent values in the block occur at similar frequencies. This helps guard against the homogeneity attack, since the simple (yet common) case where all tuples in the block have the same sensitive attribute value is eliminated. Additionally, by changing the value of ℓ, it is possible to change the strength of the data in resisting the background knowledge attack. This is because attackers would need knowledge sufficient to rule out ℓ-1 possibilities for the sensitive attribute in order to draw a definitive conclusion about the data; by increasing the value of ℓ, the difficulty of such a task is also increased, without having to take into account the types of knowledge that potential attackers may have. Based on the problem definition, the task of creating ℓ-diverse tables reduces to an optimization problem in the form of lattice search, which has several well-known algorithms to handle it. The main strength of this paper is that it identifies shortcomings in a widely used anonymization technique and proposes a more robust method to handle the same task. In general, ℓ-diversity does not require full knowledge of the probability distribution for the attributes, nor does it need the data publisher to know what types of knowledge an adversary has, and finally, instance level knowledge is automatically covered in this definition. Also, the experimental results show that running ℓ-diversity can be done in almost the same amount of time as k-anonymity, on average, while still being able to resist homogeneity attacks that compromise the same data with k-anonymity. One weakness of this paper is that it does not address the case where there are multiple sensitive attributes in play, since the complexity increases due to potential interactions/information being revealed with certain combinations of sensitive attributes. As the authors mention themselves, this is a future area of investigation. Also, it would be potentially useful to see to what extent these anonymization methods have on the accuracy/usefulness of the outputted data, i.e. how much does anonymizing the data have on the utility of the generated tables? |
The purpose of this paper is to propose a new privacy definition called l-diversity, which protects against some of the vulnerability exploits against k-anonymity. The paper talks about k-anonymity, where each record is indistinguishable from k-1 other objects given a certain sensitive attribute. The paper covers 2 privacy attacks against a k-anonymized dataset. The paper introduces the idea of microdata which are tables of information about individuals. For security reasons we don’t want to be able to uniquely identify an individual from this data. The data is sanitized by uniquely identifying attributes like name and social security number. However there are still other combinations of other attributes that can result in a unique identification of an individual, as in the linking attack that used a combination of the “quasi-identifiers” birthdate, zip code, and gender which could be linked with external information to identify the Governor of Massachusetts’ sensitive medical information. k-anonymity protects against these linking attacks by creating more ambiguity among these quasi-identifiers. The paper then goes into 2 attacks on k-anonymous data, which motivate the definition of l-diversity. The first attack is the homogeneity attack, where a sensitive attribute does not have much, if any diversity of values. So if your quasi-identifiers result in a subset of the data, you may find that all records contain the same value for the sensitive attribute. The second attack is the background knowledge attack, where your quasi-identifiers result in a subset of the data that might possess some ambiguity, but background knowledge helps you narrow down the possible values of a sensitive attribute. This paper’s contribution is the l-diversity privacy protocol where each value of a sensitive attribute is well represented in groups of data. l-diversity is built on the ideal notion of privacy called Bayes optimal. Bayes-Optimal privacy involves modeling prior information as a probability distribution over the attributes in your data. Then Bayesian inference techniques can be used to reason about privacy. The paper also defines positive and negative disclosure, where a value of a user’s sensitive attribute can be identified or eliminated, respectively, with high probability. Then building off of this they introduce the Uninformative principle where the table should give little additional information beyond the prior. The author then details l-diversity which is built on this notion of Bayes-Optimal privacy. A q*-block is l-diverse if it contains at least l “well represented” values for a given sensitive attribute. And a table is l-diverse if all the q*-blocks are l-diverse. The value of l indicates how much background knowledge your data is safe from. I did not like how this paper took so long to introduce the contribution. However, on the same note, I felt that it was very self-contained in that I did not feel like I needed any background knowledge coming in to understand the contributions. |
In this paper, the author mainly proposed the idea of l Diversity. This notion is invented to replace the k anonymity. The paper claimed that the k anonymity is vulnerable to two kinds of attacks. One is Homogeneity attack. In this situation, since in a K-Anonymity table, it is normal that there is a group of tuples in the table that all have same values of attribute. Another attack would be Background Knowledge Attack, where the user has has extra knowledge against the attributes provided in the table. L diversity is proposed, If a set of sensitive data corresponding to all records in any equal set contains L "suitable" values, then the equal set is said to satisfy L-Diversity. If all equal sets in the data set satisfy L-Diversity, then the data set is said to satisfy L-Diversity. Compared to the K-Anonymity standard, datasets that conform to the L-Diversity standard significantly reduce the risk of attribute data leakage. For data sets that satisfy L-Diversity, in theory, an attacker can only have a probability of 1/L at most to attribute a compromise attack, associating a specific user with its sensitive information. In general, data sets conforming to the L-Diversity standard are constructed by inserting interference data, but like data generalization, inserting interference data can also result in loss of information at the table level. The main contribution of this paper is that it provided a definition and algorithm to implement L diversity. The paper also discussed the potential attacks that could happen on K-anonymity. The weak point of l diversity is that it will be still vulnerable to some other attacks. Skewed, attack, if the distribution of sensitive attributes is skewed, the L-Diversity standard is likely to be unable to resist attribute data leakage. Similarity attack, If the distribution of sensitive attributes of the equality class satisfies L-Diversity, but the attribute values are similar or cohesive, the attacker may get important information. |
The paper starts by defining a problem: organizations want to publish microdata (tables with anonymized information about individuals). Obvious one-to-one identifiers, such as phone numbers or SSNs are removed from the data, but combinations of other columns like zip code, age, and nationality may still be able to identify an individual person. The mechanism historically used to protect against individuals being identified is k-anonymity: each record must be indistinguishable from at least k-1 other records with respect to these quasi-identifiable attributes. Unfortunately, k-anonymity can still cause privacy concerns. In one case (Homogeneity Attack), k-anonymity might show the same sensitive data for all of the k people. In the second case, some background knowledge may be used to identify a specific row. L-diversity attempts to prevent these attacks that are possible under k-anonymity. Before defining l-diversity, the authors define bayes-optimal privacy, which unfortunately does not protect against all of the problems with k-anonymity. The definition of l-diversity is as follows: “a q*-block is l-diverse if contains at least l “well-represented” values for the sensitive attribute S.” Essentially, on top of k-anonymity, l-diversity adds a constraint on the diversity of sensitive data allowed for each group. This solves our initial problems. The authors do a really great job explaining all of the possible attacks with clear examples. This makes it easy to understand a) the risks of k-anonymity and b) how l-diversity addresses those problems. By providing a running example with a set of “characters,” the authors make a fairly theoretical topic approachable. They almost frame the problem like a story, with sentences like “but we are jumping ahead in our story.” It makes the paper more fun to read! This paper spent a significant amount of time describing the theoretical advantages of l-diversity, but did not give much elaboration when it came to implementation details. Primarily, it was simply framed in terms of how the solution built on existing solutions for k-anonymity. I understand that this paper was more theoretical in nature than other papers we have read for this class or that I have read myself in the past, but I think the authors could have gone into a bit more depth about the algorithms. I think this is especially true because the paper was published in a data engineering venue. |
In the paper "l-Diversity: Privacy Beyond k-Anonymity", Ashwin Machanavajjhala and Co. give a detailed analysis on a powerful privacy definition called l-diversity. In the modern era, there is a great deal of sensitive information that is used to accelerate research, discover trends, or allocate funds. Aggregation over this data is both necessary and vital to reach conclusions, but uniquely identifying individuals within the dataset itself is a violation of privacy and unacceptable. Naively, one could remove columns such as name or SSN that uniquely identify a individual. However, this does not solve the issue - it is still possible to identify an individual with seemingly useless data (quasi-identifiers). In order to combat this, k-anonymity was established. K-anonymity is satisfied if every record in a table is indistinguishable from at least k-1 other records w.r.t. all sets of quasi-identifier attributes. This ensures that individuals cannot be uniquely identified by these devious linking attacks. Yet, this definition still falls short of several attacks and does not guarantee privacy. Thus, l-diversity is defined and used to solve all the previous issues k-anonymity faced. Amazingly, l-Diversity provides privacy even when the data publisher does not know what kind of knowledge is possessed by the adversary! By requiring that the values of the sensitive attributes are well-represented in each group, l-diversity is much more solid privacy definition than k-anonymity. Thus, exploring l-diversity is crucial for modifying current and future implementations of encrypted databases. The paper is divided into several sections: 1) When does k-anonymity fail: There are two types of attacks: Homogeneity attacks and background knowledge attacks. In a homogeneity attack, sensitive information can be extracted when sensitive information is the same for individuals that have the same non-sensitive attributes. Even though it is quite uncommon, it is still a situation that can occur. This signals that some sort is diversity is needed in order to combat this type of attack. The second attack, background knowledge attack, narrows down an individual through non-sensitive data and uses real-life correlations to identify an individual's sensitive information. An attack like this is quite plausible. 2) Bayes-optimal privacy: In this model, we can assume that the attacker knows the victim's quasi-identifier and the distribution of sensitive values according to the quasi-identifiers. Thus, an attacker can draw two types of conclusions: positive disclosure or negative disclosure. Positive disclosure is disastrous - the attacker correctly identifies the victim. Negative disclosure means that attacker has eliminated some of the possibilities of the victim. The problems stems from the fact that we don't know what distribution the attacker knows; we might not even know it ourselves! We also don't know the attacker's data that is not modeled with quasi-identifiers. However, it is still possible to limit the attacker's belief that a data entry is associated with a certain sensitive attribute. 3) l-diversity: This operates under the assumption that the attacker has less than l-1 pieces of information. l-diversity is guaranteed by ensuring that every q* block has at least l well represented sensitive values. There are two variations on l-diversity: entropy and recursive. Both these variations ensure that an attacker cannot dismiss sensitive values such that a single value will appear too frequently. 4) Implementation as an algorithm: k-anonymity has an important property called the monotonicity property that guarantees the correctness of all efficient lattice algorithms. However, Bayes-optimal privacy does not have this property. Both entropy and recursive l-diversity have the monotonicity property and thus, any algorithm used for k-diversity can be applied to l-diversity. Since all l-diversity tests are based off counts off sensitive data, this becomes very efficient. Much like other papers, this paper also has some drawbacks. The first drawback that I noticed was in the experiments section. Machanavajjhala used k-anonymity as a baseline for evaluating l-diversity, but I felt that other security implementations could have also been used in order to determine the rank of l-diversity. Furthermore, I felt that the metrics used to compare the utility of the two methods were constructed in a way to give l-diversity an edge. The average size of the q* blocks and the discernibility metric are directly related to one another; thus if one does better on one of them, it will do better on the other as well. Another drawback that I noticed was that the "related works" section could have been mentioned earlier as motivation for the creation of l-diversity. I didn't really feel like the exploits for k-anonymity was a major issue. Lastly, I felt that they could have discussed something about future work - perhaps concretely defining utility. |
This paper describes l-diversity, an improvement to k-anonymity that allows an improved form of sharing unaggregated information without revealing sensitive data. These are both ways that some nonsensitive attributes of a dataset can be revealed, without allowing an outsider to connect those attributes to a single person as well as the sensitive, unrevealed attributes of the dataset. This is an attempt to allow some generalized data to be revealed without revealing any sensitive data. A generalized domain for any data value just means gathering the data values into groups (like ranges for a numeric type), and only publishing the group rather than the underlying data. A generalized data set has this property for every revealed attribute. K-diversity means that given the revealed data, any single person’s data is indistinguishable from at least k-1 other people. As such, people are grouped into groups of size k that have indistinguishable attributes. However, the paper argues that this is insufficient to protect sensitive data for two reasons. First, it’s possible that the sensitive data for every group member is identical. In this case, it doesn’t matter that the revealed data offers no distinctions, as any attacker can guarantee what the sensitive data is. Secondly, the groups are only guaranteed to be indistinguishable based on the revealed data. A potential attacker may have even more background data on a single person that allows them to break the unity of the group, and determine that person’s sensitive data. The idea behind l-diversity is that an attacker has a prior belief as to what the sensitive attributes are for a person, and a posterior belief generated after the data is revealed. Then ideally, the difference between these beliefs should be small, so the attacker gets little information from the revealed data. To make something l-diverse, it uses the same groups as for k-anonymity. However, instead of each group requiring k indistinguishable entries, each group should have l well-represented sensitive data values. As such, the attacker can’t figure out which sensitive data value is correct without ruling out l-1 possibilities. With multiple sensitive values, then each value is considered individually, and all other sensitive values are part of the group. So, for a group with all matching values, including sensitive values except the one in consideration, that value should have at least l well represented values. There’s also the concept of recursive l diversity, which says that the l-2 most common values (except the first most common) could be removed without the most common value giving too much skew to the resulting distribution. This guarantees that even if an attacker can rule out several values, they still can’t infer anything from the resulting distribution. This paper was very good at explaining the weaknesses of k-anonymity. It was also able to use clear formal language in order to describe all of the properties needed for l-diversity. In addition to the formal script, the paper was able to give intuitive definitions and examples of the concepts, which greatly helps understanding. On the downside, the formal notation can be hard to understand without a lot of work. There’s not really a way to fix this, since it’s important to have precise notation when dealing with security concepts, but it does impede the flow of the paper. |
The paper presents l-diversity, which is a novel and powerful privacy definition. In recent years, the definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k −1 other records with respect to certain “identifying” attributes. However, k-anonymity has some subtle, but severe privacy problems. First, an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. Second, attackers often have background knowledge, and that k-anonymity does not guarantee privacy against attackers using background knowledge. The attacks can be more generalized as homogeneity attack and background knowledge attack. Based on these considerations, the authors make several observations that k-Anonymity can create groups that leak information due to lack of diversity in the sensitive attribute and k-Anonymity does not protect against attacks based on background knowledge. To address this problem, the author show that the notion of Bayes-optimal privacy naturally leads to a novel practical definition that we call ℓ-diversity. ℓ-Diversity provides privacy even when the data publisher does not know what kind of knowledge is possessed by the adversary. The main idea behind ℓ-diversity is the requirement that the values of the sensitive attributes are well-represented in each group I think this paper is helpful and convincing as it make accurate observations and address the problems by experiments. I dislike the paper since it contains too many definitions and hard to read. |
K-anonymity dataset is a definition of privacy in which each record is indistinguishable from at least k-1 other records with respect to identifying attributes. This paper claims that k-anonymized dataset has severe privacy problems and illustrate this point by showing two attacks. This paper then proposes a new powerful privacy definition called l-diversity. The first attack is homogeneity attack. Such scenario is that people can figure out sensitive attributes with non-sensitive attributes. Such cases are not uncommon. The second attack is background knowledge attack. This scenario is that people can figure out sensitive attributes with background knowledge and insensitive attributes. Therefore, k-anonymity does not protect against attacks with background knowledge. Therefore, a stronger definition of privacy is needed. l-Diversity provides privacy even when the data publisher does not know what kind of knowledge is possessed by the adversary. The main idea behind l-diversity is the requirement that the values of the sensitive attributes are well-represented in each group. Author first introduces Bayes-Optimal, which is an ideal notion of privacy definition. It involves modeling background knowledge as a probability distribution over the attributes and uses Bayesian inference techniques to reason about privacy. However, this approach is not practical and has limitations such as insufficient knowledge. L-Diversity is the practical version of Bayes-Optimal privacy definition. L-Diversity is based on the calculation of observed belief of the adversary. There is a case called positive disclosure which could happen due to combination of two factors: lack of diversity in the sensitive attributes and strong background knowledge. In the case of lack of diversity, almost all tuples have the same value for the sensitive attribute. In the case of strong background knowledge, a system cannot guard against attacks employing arbitrary amounts of background knowledge. However, the system could still guard against instance-level knowledge. Another principle of l-Diversity is that a q*-block is l-diverse if contains at least l well-represented values for the sensitive attribute. To sum up, l-Diversity does not require knowledge of the full distribution of the sensitive and nonsensitive attributes and the data publisher to have as much information as the adversary. Also, instance-level knowledge is automatically covered. Author believes that l-diversity is practical, easy to understand, and addresses the shortcomings of k-anonymity with respect to the background knowledge and homogeneity attacks. The strength of this paper is that it gives plenty of examples for readers to understand what l-Diversity is capable of guarding against. Also, the cases of attacking k-anonymity are also helpful for understanding the need for l-Diversity. The weakness of this paper is that the theoretical equations are quite hard to understand. It would be better if there are more explanations of those equations. |
“l-diversity: Privacy Beyond the k-Anonymity” addresses the challenge of keeping data anonymous against attacks powered by an adversary having background knowledge (i.e., knowing non-sensitive identifiers and/or common knowledge) in combination with lack of diversity in relevant data records. This is useful for data that either is publicly available, or if inappropriately accessed by an adversary would reveal sensitive information. k-Anonymity is an earlier definition that provides some guarantees about data privacy; if a table is k-anonymized (i.e., certain data attributes are removed or generalized), then a given record’s quasi-identifier (i.e., non-sensitive values) is indistinguishable from that of at least k-1 other records. This is a nice guarantee that the adversary cannot determine exactly which record corresponds to a particular person/item they are looking for, and in many cases could keep sensitive information about that person/item hidden, but not in all cases. If all or a majority of the k records have the same value for the sensitive attribute, then the sensitive value is revealed for the person/item of interest. Relatedly, if the adversary has some background knowledge that enables them to eliminate certain sensitive values (and therefore records) as the answer, then a lack of diversity in the remaining records’ sensitive values could also reveal the sensitive value for the person/item of interest. The new definition proposed in this paper, l-diversity, provides a stronger guarantee of privacy, such that an l-diverse table is not open to the above 2 attacks. If a table is l-diverse, then this means that for a given block of records with the same quasi-identifier, there are at l “well-represented” sensitive values amongst those records. This prevents against attacks where the adversary has background information that eliminates l-2 or fewer sensitive values. In essence, an l-diverse table has, per block, a large enough set of sensitive values that are equally represented, so that a sizable number (i.e., l-1 or more) of sensitive values must be eliminated in order for sensitive information to be revealed. The paper also describes different instantiations of l-diversity (entropy, recursive, positive disclosure-recursive, and negative/positive disclosure-recursive) and how to implement the creation of an l-diverse table. Finally, the paper presents experimental results for a few different measures: 1) k-anonymous tables and which ones correspond to l-diverse tables protected against homogeneity attacks, 2) performance of generating k-anonymous and l-diverse tables (where k=l); in general performance is comparable, and 3) measurement of dataset utility for k-anonymous and l-diverse datasets over a few different k=l parameters; in general k-anonymity, entropy l-diversity, and recursive l-diversity have similar utility, except for larger values of k,l (k=l=8) where entropy l-diversity has much lower utility. I think it’s a clever approach, making guarantees about privacy based on the degree of diversity of data record sensitive values. The experimental results that show that there are datasets and k and l values where most k-anonymous tables are not protected against homogeneity attacks (and the authors show what l-diverse table the immune k-anonymous tables correspond to). I also appreciate the utility experiments, because as privacy increases, utility decreases so it’s important to see which k-anonymity and l-diversity instantiations provide highest utility. Although the “homogeneity attack” experiment gives examples of how many l-diverse tables can be created for these 2 particular datasets, it would still be nice to know how easy it is to create an l-diverse dataset; in other words, for what kinds of datasets is it not possible to create an l-diverse table? It would also be nice to know what value of l is appropriate or ideal for a given table size, given number of possible sensitive values, or given diversity of a table. It is hard to understand how in practice I would pick my value l. |
This paper proposed l-diversity, which is a new and powerful privacy definition for data. The paper first discussed k-anonymity's being vulnerate to two kinds of attacks, so it is not a desirable approach to protect privacy. The two types of attacks are homogeneity attack and background knowledge attack. The paper gave a detail explanation of the two attacks. The examples help readers to understand how the attack works. The first attack is interesting. It shows that when the sensitive attributes are lack of diversity, it is likely for users to infer the sensitive attributes by narrowing down the search space. The second attack is also reasonable. By integrating external knowledge, users can be likely to infer the sensitive attribute by some probability. After discussing the attacks, the paper proposed l-diversity, they gave a formal definition and shown the experimental evaluation that l-diversity is practical and efficient. The paper first talks about bayes-optimal privacy, which is an ideal notion of privacy for the case that both data publishers and adversary have full background knowledge. However, this ideal notion is not practical due to the fact that the data provider is unlikely to possess all the information. To tackle the problem, this paper proposed l-diversity. The main part of the paper is to give the formal definition of the method. I don't want to talk too much about it. The strong part of the paper is very clear to me. It proposed a new privacy model with a mathematical definition, and the experiment seems good to me. It seems theta the definition of the model is well-designed and proven. I like the experiment to have a clear comparison toward k-anonymous. The weak part of the papers to me is that it is a little boring to see so many definitions in a paper. I think more examples along with the definition can help readers to understand the paper more quickly. |
In this paper, the authors proposed a novel and powerful privacy definition called l-diversity. l-diversity is a stronger privacy definition beyond the k-anonymity. The problems issued in this paper are the linking attacks. The microdata is an important source for data analysis. However, the private information of a person may be a leak if the person can be identified by the microdata uniquely. The solution is to guarantee the k-anonymity of data. In a k-anonymized dataset, each record is indistinguishable from at least k −1 other record with respect to certain “identifying” attributes. However, in this paper, the authors point out that there can be two simple attacks that make k-anonymized dataset subject to severe privacy. In this paper, they gave a detailed analysis of these two attacks and proposed their solution. This problem is very important because hacking the microdata and identify individual information is a violation of human privacy, which is not moral and safe, so we must design a new mechanism to avoid the leakages of personal information. In this paper, they introduced the practical definition of l-diversity. It provides privacy without needing to know the quantity of information is possessed by the adversary. First of all, the authors show that the traditional k-anonymity is not secure by two kinds of attacks. They find that an attacker can discover the value of sensitive attributes when there is little diversity in those sensitive attributes. Also, attackers often have background knowledge that k-anonymity does not guarantee privacy against attackers using background knowledge. The main contribution of this paper is the introduction of l-diversity privacy which can prevent the linking attack by using Bayes-Optimal Privacy. However, the traditional Bayes-Optimal is also facing some problem, like the lack of knowledge of data publisher and the lack of adversary’s knowledge. Under the presence of multiple adversaries, the Bayes-Optimal may also fail to keep the privacy. To deal with these problems, the l-diversity is come to help. An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. l-diversity overcome the shortcoming of Bayes-Optimal Privacy because it does not require the knowledge of the full distribution of the attributes. One can easily implement l-diversity by simply using l-diversity when checking k-anonymity. There are several advantages to this paper. First of all, the insight of paper is great, the authors point out the potential problem for k-anonymity and proposed their solution. Second, they use many mathematical formulas to express the idea of l-diversity which is pretty straightforward. Also, this paper using rich examples and give a clear description of the l-diversity, which is easy to follow and understand. Besides, the experiments section of this paper is thorough, it shows that l-diversity can provide high security than k-anonymity and the results are convincing. Generally speaking, this is an interesting paper solving the k-anonymity problem. I find some drawbacks of this paper. First of all, as they said in their future work, the attributes it can handle is still quite limited. The l-diversity can handle single sensitive attribute, however, when there are multiple sensitive attributes or continue sensitive attributes, l-diversity is not able to handle it. I think in a real situation; the multiple attributes are commonly used thus the current version of l-diversity may not so practical. Secondly, to the tradeoff between utility and privacy, people are paying more attention to privacy, thus the concept of utility is not well-understood in this paper. Besides, another concern is that their optimal solution is not practical with insufficient knowledge and multiple adversaries, I think they relax the constraint and solve the problem from a strong background knowledge, so that if we have limited background knowledge, this method may not work well. |
This paper introduces l-diversity, which is a definition of privacy whose creation is motivated by the weaknesses of k-anonymity, which is a privacy definition that guarantees that every record in a dataset will be indistinguishable from at least k-1 existing records with respect to a set of identifying attributes. The two weaknesses 1) If there isn’t sufficient diversity in a dataset, an outsider can infer private information of individuals because if everyone in a subset of a dataset has the same values for certain columns, not knowing the identifying columns doesn’t matter. 2) Attackers with background knowledge have an increased ability to break the protection that k-anonymity offers. This is because hiding only identifying attributes is less useful if a person already knows the identifying attributes of a few individuals and can infer other information by comparing what they know to the public information in dataset records. L-diversity solves this by modeling privacy using Bayesian statistical theory, and also assuming that an adversary will have at least partial knowledge of a database. This background knowledge is modeled as a probability distribution; doing this isn’t enough because a publisher usually won’t know what this distribution is, and won’t know what the adversary knows, so a principle called L-diversity is introduced, which states that blocks are l-diverse if at least l “well-represented” values for each sensitive attribute—instantiating this can prevent attacks on k-anonymity tables. The main contribution of the paper is the rigorous formal definition & analysis of this new definition of privacy, similar to the paper we read about the new definition of isolation levels. This paper also includes experimental analysis to prove that l-diversity can provide good performance, which gives convincing evidence that l-diversity can be implemented realistically as well. Another advantage is that existing k-anonymity algorithms can be adapted for l-diversity tables, which means that at least a slight degree of backwards compatibility exists. One weakness is that the paper was pretty dense — most of this is due to the paper’s nature as a formal specification, so this might be an unfair thing to list as a “weakness”. |