Organizations would like to publish microdata containing information about their users without aggregation, in a way that protects individuals’ privacy. Unfortunately, even after identifiers such as social security numbers and names are removed, remaining pseudo-identifiers like zip code and date of birth can still allow individuals to be de-anonymized. Prior work has suggested approaches to further anonymize data releases, such as data swapping and permutation, or generalization. One such approach is k-anonymity. A database whose pseudo-identifiers have been generalized (e.g., by suppressing the last digits of each zip code), is considered k-anonymous if each group of records with equivalent pseudo-identifiers is at least k records in size.|
There are two key problems with k-anonymity as a measure of privacy. First, homogenous values for sensitive fields within a group of records having the same pseudo-identifier can allow attackers to infer the value for a person in that group. Second, prior knowledge held by an attacker can allow the attacker to deduce from the set of sensitive values for a matching-identifier group, that a person of interest in that group probably has a certain sensitive value.
The authors propose l-diversity as an alternative measure of data set privacy protection. A data set is l-diverse, for l at least 2, if for every group of records with matching pseudo-identifiers, there are at least l “well-represented” values for the sensitive field. Two proposed measures of being well-represented are entropy and recursive (c,l)-diversity. It would be ideal to use Bayesian methods to produce a data set that minimizes an attacker’s information gain, while maintaining value for researchers. However, this is not feasible, because there is no way to find out potential attackers’ prior knowledge. L-diversity is a practical way to estimate what attackers can learn about individuals from a data set.
Even with the l-diversity measure, one cannot anonymize a database that has extremely low diversity in its sensitive values, because there will be low entropy in the sensitive values even after generalization. While (c,l)-diversity is easier to satisfy than entropy l-diversity in such cases, this does not cure the underlying problem that some data sets are too homogenous to release while protecting individuals’ information and providing value to researchers.
The paper deals with the concept of k-anonymity and authors’ attacks on it revealing privacy problems with it. Large amount of person-specific data has been collected in recent years by governments and private entities. Data extracted by data mining techniques represent a key asset to the society. Some information is supposed to be made public but most of the people-information should be anonymized. This is usually done by removing personal identifying information like name, SSN, email etc from the data. But this method is prone to linking attacks which looks for links in data and re-identifies the personal information. To deal with this type of attack, each identifier is populated in atleast k records. This technique is called k-anonymity.|
K-anonymity is subjected to two type of attacks by the authors – homogeneity attacks and background knowledge attacks. Homogeneity attack involves knowing some information pre-hand about the person and using that to link to the information in another dataset. Background knowledge attacks have more specific information in the target table zeroing in the data which is needed by the attacker. To deal with such attacks, a new model termed as Bayes-Optimal Privacy is used. This uses Bayesian inference techniques to reason about privacy by modelling the background knowledge as a probability distribution over the attributes. Principles such as positive and negative disclosure helps in identifying if an adversary can correctly identify or eliminate the value of a sensitive attribute with high probability. But Bayes-optimal privacy has insufficient knowledge about the attacks making it a non-fool-proof method. The authors suggest using the L-Diversity principle which calculates the observed belief of the adversary. A lack of diversity in the sensitive attributes and/or strong background knowledge satisfies the condition of this principle. Distinct l-diversity has each equivalence class with at atleast l well-represented sensitive values. It doesn’t prevent probabilistic inference attacks.
The authors are comprehensive in providing their solution of using l-diversity as a means of anonymizing data. The algorithms used provide a stronger level of privacy than k-anonymity routines. The discussion is supplemented by testing results as well.
The paper though is a little less convincing when the subject of testing is broached. The performance level of l-diversity doesn’t match that of k-anonymity and its extensions such as entropy l-diversity is too restrictive to be practical.
What is the problem addressed?|
Publishing data is increasingly common thing that helps allocation of resources and trend analysis. Therefore, to ensure the sensitive information is not disclosed is an important problem. This paper discuss this problem and present tow simple attacks that the previous privacy notation fails. Moreover, it proposes an alternative privacy notation to circumvent those weakness.
Privacy has become more and more important given more and more information fly in the air. However, privacy is different from cryptography due to the more diverse setting of privacy. The problem this paper trying to tackle is one of them. Recoverability is important in two folds: one is to ensure the privacy, the other is to learning through partial information which is the main topic of machine learning.
1-‐2 main technical contributions? Describe.
The paper present two natural scenarios that the k-anonymity which is conventional privacy notation by that time fails. One is called homogeneity attack, because even there are k-anonymity in the table, it’s useless if all the anonymity have the same sensitive information the adversary can still figure out the sensitive information. The other is background knowledge attack: the adversary have prior knowledge and with the information in the table, it can learn something from the conjunction of those information.
Basically, the previous two settings are problematic, because k-anonymity fail to consider the correlation between nonsensitive data and sensitive data. Therefore the l-diversity this paper present is to make sure give the partial information in nonsensitive data the distribution in the sensitive data have enough “diversity.” That’s the reason the paper use entropy which is a measure of diversity of distribution.
1-‐2 weaknesses or open questions? Describe and discuss
I think the problem is not well defined. On one hand the, we want to prevent adversary from learning addition information. On the other hand, we still want to public the data as much as possible(I guess, or we can just not reveal any data which will achieve privacy easily). Therefore the privacy should consider how little information the adversary can derived and how much information the information public.
This work seem like classical cryptography which lack concrete theoretical guarantee, and clear definition about the adversary.
The question is quite relevant to cryptography and information theory, and it would be helpful to adopt notations from it, for example: zero knowledge. On the other hand, an easy alternative method to enhance privacy is by adding noise(insert dummy entries to skew the distribution). In short, I like the setting of the problem.
This paper presents a new framework called l-diversity, which is a novel and powerful privacy definition for database (better than k-anonymity). In short, l-diversity gives stronger privacy guarantees than famous framework k-anonymity in dataset. The idea and framework is presented in this paper, but it also shows experiments that the implementation is easy to carry out and l-diversity is practical in real life. This paper first introduces k-anonymity, and two simple attacks homogeneity attack and background knowledge attack which makes k-anonymity vulnerable. Second, it presents ideal privacy Bayes-optimal. Then it moves to the details of l-diversity, including model, notation, principle, instantiation, and implementation. Then it presents an experiments on generating l-diverse tables and performance evaluation. Finally, it discusses related and future works.|
The problem here is that the known popular privacy framework k-anonymity has some vulnerabilities, and it contradicts with the requirement of a stronger privacy in dataset. It is important to maintain privacy in dataset, for example medical records, voter registration, and customer data. K-anonymity means that a table satisfies that each record is indistinguishable from at least k-1 other records. However, there are two attacks that makes it vulnerable, homogeneity attack (small diversity), and background knowledge attack (attacker has knowledge on exact value for quasi-identifiers). Therefore, this paper presents a better framework called l-diversity to provider stronger privacy.
The major contribution of the paper is that it gives a clear discussion about dataset privacy, and clearly presents the potential attacks on k-anonymity. It also discusses the ideal notion of privacy called Bayes-optimal and how to generate practical framework l-diversity out of it.
One interesting observation: this paper is very innovative, and it shows that k-anonymity has some vulnerabilities. By developing a new framework l-diversity, it can improve the dataset privacy. One possible weakness is that it only supports for distinct sensitive attributes. It might be better if it can support continuous sensitive attributes. Also, it will be better if there is a standalone implemented program or embedded system.
This paper focuses on an interesting problem, not strictly related to database systems, but relating to the data that is stored. Large organizations, especially those funded by public taxpayer money, released the data for public use. This can include public voter registration information, results from medical studies, university studies, census data, and customer data from corporations. While this data my by "anonymous" by itself, we can often link the data with other data sources to find that we can uniquely identify people by combining data tables from different data releases. There is also the fact that even some "non-identifying" information can be uniquely identified - for example, 87% of the US population is uniquely ID'ed by date of birth, gender, and and five digit zip code. We can combine these traits, with some similar information found in medical records, to actually determine the medical history for people found in voting records - indeed, this happened with the governor of Massachusetts. |
Anonymizing the data to some degree can help. THis involves selecting some attributes as sensitive, and some as non-sensitive. The sensitive ones are often the values of interest, but we want to make sure that those values cannot be connect back to an actual person. However, with background knowledge, this is possible. For example, if I know that a friend of mine who is Japanese who is admitted to the hospital, I can check the data and see that people in his age, gender, and zip code group had either heart disease or viral infection. However, Japanese have very slow rates of heart disease, so I can assume he had a viral infection. THis would be made even easier if everyone in the group had a viral infection, then I wouldn't even need to know any background info - this is known as a homogeneity attack.
The authors note all of these problems, and propose a stronger form of anonymity. It uses the idea of Bayes-Optimal privacy to provide a defense even in the presence of strong background knowledge. The results look promising, and the extra time spend making sure the data is anonymous is absolutely worth it - in the face of almost constant data leaks, protect (even publicly released) data is important.
Problem and Solution:|
The problem proposed in the paper is the linking attack. The micro data is an important source for research and analysis. But the private information of a person may be leak if the person can be identified by the micro data uniquely. The solution is to ensure the k-anonymity of data. It is a privacy rule which means at least k-1 other records will be selected with respect to every set of quasi-identifier attributes so that no person can be uniquely identified by linking attack. In this paper, it provides the practical definition l-diversity. It provides privacy without needing to know the quantity of information is possessed by the adversary.
The contribution of the paper is that it puts forward the l-diversity privacy, which prevents the linking attack. It is derived from the Bayes-Optimal Privacy. However, Bayes-Optimal privacy does not work well because the knowledge of data publisher is not enough and the amount of the adversary’s knowledge is unknown. And when there are multiple adversaries, the Bayes-Optimal may fail to keep the privacy. In l-diversity privacy, the linking attack may happen in two situations: the data set lacks diversity and the attacker has a lot of background knowledge of the persons. However, currently l-diversity cannot support multiple sensitive attributes. The situation is more challenging because even though it is l-diverse in all the single attributes separately, the l-diversity of all the attributes cannot be met.
Although the paper provides an interesting algorithm to solve the k-anonymity problem, there are still some weakness. One is that the attributes it can handle is still quite limited. When there are multiple sensitive attributes, it cannot promise the l-diversity. However, in real situation, multiple attributes are always used, so current version of l-diversity is not practical in real use.
k-anonymity is a definition of privacy to protect against the sensitive information leakage. In a k-anonymized dataset, each record is indistinguishable from at least k-1 other records with respect to certain “identifying” attributes. |
This paper we shows two simple attacks that a k-anonymized dataset has some severe privacy problems. (1) Attackers can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. (2) Attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. For example, the attacker can deduce the result from a smaller range of values.
To deal with such problems, the paper introduces a new definition of privacy named l-diversity. An equivalence class is said to have l-diversity if there are at least l “well-represented” values for the sensitive attribute. The term “well-represented” means there are at least l distinct values for the sensitive attribute in each equivalence class and the most frequent value does not appear too frequently, and the less frequent values do not appear too rarely.
At the beginning of the paper, the author talks about two simple attack that k-anonymity cannot protect against. Then the paper introduces a new definition of privacy which is stronger and safer. The paper uses experiment to prove that l-diversity is practical and works efficiently.
(1) Hard to implement if the sensitive attribute has very small range of values.
(2) I think it is still dangerous if one individual’s information can be deduced to a smaller range of data and would cause some problems. For example, if a person’s information shows that it is in a range of records which 99% sensitive value is semantically “bad”, this person will feel angry and attackers can still get some information of a person.
This paper introduces a new form of privacy protection for micro datasets called l-diversity. k-anonymity is a popular method for protecting the privacy of individuals within a dataset by ensuring that there are at least k-1 individuals who share the same of non sensitive tuples. However, this paper points out that k-anonymity fails to maintain privacy in the face of homogeneity and background knowledge attacks. For homogeneity, the main problem is that although the non-sensitive columns may be the same, the remaining columns can allow for identifications of some users based on the sensitive columns. Background knowledge attacks are the result of the attacker possibility having more specialized knowledge of the data that allows for elimination or refinement of some tuples that are supposed to be secret. l-diversity seeks to solve these problems by ensuring that even if tuples are similar in non sensitive column fields, the non sensitive column field has at least l different tuples. The paper shows that this scheme protects again both homogeneity and background knowledge attacks as well as instance knowledge attacks.|
l-diversity is a very interesting and simple solution to the problems of k-diversity. Although it is limited in that the algorithms in the paper only allow for l-diversity on a single, discrete column. To me, this suggests a more proof of concept idea rather than a full work. Only being able to protect a single column is not very interesting since data may have many more columns that need to be protected such as a credit card number and a SSN.
It was refreshing to see that the algorithms to compute k-anonymous tables can be reused for l-diversity. New works that are solving problems of older works sometimes tries to replace the existing framework as well. This paper instead takes advantage of existing algorithms. This means that current optimizations of the algorithms can probably be applied to l-diverse algorithms which can be beneficial if the optimizations are the results of refinement over the years.
Motivation for l-Diversity|
The purpose of l-Diversity is to present data about individuals without revealing sensitive information about them. This is important because more and more organizations are publishing unaggreated information about individuals, which are valuable for information about where, for example, to allocate funds, medical research, trend analysis, but would be unacceptable if individuals can be traced back from their private information. The recently popular k-anonymity for defining privacy fails in that attackers can discover values of sensitive attributes with little diversity, and in that it does not guarantee privacy for attackers with background knowledge. L-diversity is a stronger of notion of privacy, with more defenses against attacks, and can be efficiently implemented.
The l-Diversity principle says that a q*-block is l-diverse if it contains at least l “well-represented” values for the sensitive attribute S, and a table is l-diverse if every q*-block is l-diverse. For a block to be “well-represented,” it must have at least l >= 2 different sensitive values so that the most frequent l values have about the same frequency. The block must also have l-1 damaging pieces of background knowledge to eliminate l-1 possible sensitive values and infer a positive disclosure. Entropy l-Diversity says that a table is Entrpy l-Diverse if every q*-block has at least l distinct values for the sensitive attribute. This may be too restrictive when positive disclosures (such as revealing a patient has “heart problem”, since that is common and not identifiable) are acceptable. Recursive (c,l)-Diversity is less conservative and says that in a given q*-block, let r(i) denote the number of times the ith most frequent sensitive value appears in that q*-block. Then for a constant c, the q*-block meets the recursive (c,l)-diversity if r(1) < c(r(l) + r(l+1)+…+r(m)) and a table satisfies recursive (c,l)-diversity if every q*-block satisfies recursive l-diversity. 1-diversity is always satisfied. In more concise terms, this means that the q*-block satisfies l-diversity if we can eliminate the l-2 most frequent values of S not including r(y) without making s(y) too frequent in the output set. In order words, after we remove the sensitive values with counts r(1),…,r(y-1), the result is (l-y+1)-diverse. Positive Disclosure-Recursive (c,l)-diversity says that even though the l-1 most frequent values can be disclosed, we still do not want r(y) to be too frequent if l-2 of them have been eliminated. Negative/Positive Disclosure-Recursive (c1,c2,l)-Diversity is satisfied when it meets pd-recurve (c1,l)-diversity and if every s in W appears in at least c2 percent of the tuples in every q*-block. Multi-Attribute l-Diversity is met when a table’s sensitive attribute is treated as the sole sensitive attribute and all remaining attributes are treated as the quasi-identifier.
l-Diversity counters the weaknesses of Bayes-optimal privacy because it does not require the knowledge of the full distribution of the sensitive and nonsensitive attributes, does not require the information of the data publisher to be as much as the adversary since l protects against more knowledgeable adversaries, covers instance-level knowledge, and protects against all forms of background knowledge without having to check which inferences can be made with which levels of background knowledge. To implement l-diversity algorithms, every time an algorithm checks for k-anonymity, check for l-diversity instead. This test is efficiently performed because l-diversity is local to each q*-block and are solely based on counts.
Strengths of the paper:
I liked that the experiments in the paper were on real world data from US Census and Lands End. It was intriguing (and concerning) that with k-anonymized dataset, more then 1000 tuples had their sensitive value revealed. It was also surprising and good to see that l-diversity is low maintenance to implement (replacing k-anonymized checks with l-diversity checks) and does not take more run time.
Limitations of the paper:
I would have liked to see more discussion on the implementation of l-diversity, and perhaps some sample code of how it is implemented. I would have also liked to see more discuss on what happens when l-diversity is not met in the data, and what steps can be taken to ensure privacy while still abstracting useful information from the data in those cases.
This paper titled "l-Diversity: Privacy Beyond k-Anonymity" was an interesting insight into a more security related topic. This paper explores background knowledge and homogeneity attacks which target k-anonymized, publicly released data. This paper explains an ideal solution to this problem which is called Bayes-optimal privacy. The authors then go on to explain why this is impossible in practice and that their own solution, l-diversity, can be used instead. This is an approach that requires data to be well distributed such that even if an adversary has background knowledge they still cannot figure de-anonymize the data.|
This is a pretty good paper but its biggest drawback is related to its discussion of multiple sensitive attributes. Making data publicly available requires diversity in the data to retain anonymity. Having multiple sensitive attributes makes this exponentially more difficult because each new combination has to be well represented. This problem is mentioned in the paper but not thoroughly discussed. The reader should be told how difficult this problem is rather than (or in addition to) being told that "this problem may be ameliorated through tuple suppression and generalization on the sensitive attributes".
Despite this drawback this is a great paper. The authors describe the problem of homogeneity and background knowledge attacks on k-anonymized data. They provide the necessary theoretical background for understanding the ideal solution, its drawbacks, and a proposed solution for preventing these attacks. The authors then evaluate the algorithm they propose for l-diversity in terms of performance and utility which shows us that we can use this technique in practice to provide stronger privacy guarantees with publicly released data.
Part 1: Overview|
This paper presents a way to measure data privacy called l-diversity, which is more powerful than k-Anonymity. Ensuring sensitive data are securely stored in very important. However there is a trend of publishing aggregated information by those big organizations. To prevent from being identified using the microdata, some uniquely identifying information such as SSN should be removed from tables. This kind of attributes that can be linked with external data in order to uniquely identify individuals are called quasi-identifiers. The existing definition in the literature is called k-Anonymity, which means table satisfies k-Anonymity if every record in the table is indistinguishable from at least k − 1 other records with respect to every set of quasi-identifier attributes.
They try to hide information behind entropy. A q-block is l-diverse if it contains at least l number of well preserved values for some sensitive attribute. And from there there they call a table is l-diverse. Thus entropy l-diversity may sometimes be too restrictive according to their explanation. If some positive disclosures are acceptable they claim to achieve even better performance. This reasoning lead to develop a less conservative instantiation of the l-diversity principle called recursive l- diversity. For example, maybe a clinic should be allowed to disclose that some patient has a “heart problem” because it is well known that most patients who visit that clinic have heart problems.
Part 2: Contribution
They propose two examples that simple attackers can get information from k-Anonymity data, which kind of drag out the need for this paper, to define a stronger privacy measure. The definition is precise and clear.
They showed their privacy measure can actually hold the homogeneity attack using lands end and adult databases. They implemented the privacy preserving data publishing system.
Part 3: Drawbacks
There are, still, gaps between l-diversity and Bayes optimal solution. The optimal solution is not practical with insufficient knowledge and multiple adversaries and that’s why they relax the constraints and start from strong background knowledge. There may be some more space to improve here.
The paper discusses about attacks on the k-anonymity privacy definition and proposes a new privacy definition called l-diversity, which addresses problems associated with k-anonymity. The problem is that publishing data online about individuals without revealing sensitive information is difficult. For example, a linking attack can be used to reveal information by using seemingly innocuous attributes, like age, birthdate, zip code, and medical condition, with another less sensitive database but which has additional attributes like name. This problem is addressed by k-anonymity, which anonymized the database table by creating k number records per each original record with attributes used for linking. However, the authors showed that k-anonymity doesn’t guarantee privacy and showed two possible security attacks on it. |
The first attack is called homogeneity attack. The reason that k-anonymity is prone to this attack is that it can create groups that leak information due to lack of diversity in the sensitive attributes. For example, if each of the anonymized records all have the same sensitive medical condition, cancer, then this information can be revealed. The second attack is called background knowledge attack. Attackers can use background information to extract sensitive information. For example, if the anonymized record has either a virus or heart disease condition, then based on the citizenship attribute, it could be determined which one the information is. If the the citizenship is Japanese, since Japanese have an extremely low incidence of heart disease, then the condition can be construed as a viral infection. The k-anonymity does not protect against such kind of attacks which are based on background information.
The paper discussed an idealized notion of privacy called Bayes-optimal for the case that both data publisher and the adversary have full (and identical) background knowledge. However, Bayes-optimal privacy is difficult to guarantee in practice. The authors proposed a practical, new privacy definition called l-diversity which address the attacks on k-anonymity. This method is based on the idea that the values of the sensitive attributes are well-represented in l number of groups. By using background information, it is difficult to determine the sensitive information as this information is guaranteed to be represented in different groups.
The main strength of the paper is that it provides a novel technique to guarantee publishing information without revealing sensitive information. In addition, it used similar algorithm with k-anonymity with equal or better performance while addressing important security attacks on this approach.
The main limitation of l-diversity is that it leads to a very large size groups which leads to inefficiency. The authors claimed that this happen due to data skew, which is an important avenue for future work. It would be interesting to see if their future work addressed this issues.
This paper reveals some subtle, but severe privacy of k-anonymity. It proposed a novel and powerful privacy definition called l-diversity. Publishing data about individuals without revealing sensitive information or even identify a specific individual is important. In this paper, the author proposed two attacks for k-anonymity and shows that k-anonymity is susceptible to homogeneity and background knowledge attacks. The first attack is homogeneity attack. k-Anonymity can create groups that leak information due to lack of diversity in the sensitive attribute. The second one is background knowledge attack. k-Anonymity does not protect against attacks based on background knowledge.
The key contribution of this paper is that it reveals the privacy issue of k-anonymized data set and proposed a better privacy definition called l-diversity and its corresponding experiment evaluation to show that l-diversity is practical and can be implemented effectively.
The policy works fine for single privacy attribute but not for multiple sensitive attributes. Also, as mentioned in the paper, utility and privacy is a pair of trade-off. Sometimes too much attention on privacy hurts utility of the data set processed. The paper doesn't address enough question on how useful the data set will be after applying the l-diversity policy.
This paper introduced a new standard for publishing data called l-Diversity, which is stronger than k-Anonymity.|
k-Anonymity is a widely used standard for privacy when publishing datasets containing sensitive data. But in this paper, the authors point out that it has at least two flaws. The first attack is Homogeneity Attack. In k-anonymous there can be multiple entries with identical values. Although this makes sure that there are at least k identical entries, it doesn’t prevent data leak when you are looking for a specific combination of insensitive data. Another attack is background knowledge attack, saying that people can be identified with the help of background knowledge.
So in this paper, the authors present l-diversity. The simplest form is that for any qusi-identifier, there are at least L different data values. And this is called Distinct L-diversity. For entropy L-diveristy, A table has entropy l-diversity when for every equivalent class E, Entropy(E) ≥ log(l). And There are also recusive (C,L)-diveristy.
This paper has two main contributions. The first one is that it points out the weakness of k-anonymity. The second is presented a better standard for sensitive dataset publishing.
This paper didn’t provide much information on how to transform a dataset into an l-diversity one. I think it would be better if an automatic algorithm could be provided.
L-Diversity brings up the very relevant situation where datasets with sensitive information such as medical records and results of experiments are released with k-anonymity. This paper points out two attacks that could result in data about specific individuals being identifiable from the anonymized data.|
The authors begin with the premise that data such as gender, date of birth and zip code are publicly available information that can identify unique individuals. They called these values quasi-identifiers. They point out two kinds of attacks: a homogeneity attack where the attacker has access to most of the quasi identifiers based on observation and is able to gain access to sensitive information because of that. The next kind of attack i.e, background knowledge attack, where the attacker has information based on worldly knowledge combined with information that the attacker is already aware of in order to gain access to sensitive information. Therefore, k-anonymity seems to be vulnerable to lack of diversity in groups and attacks based on background knowledge. One of the key points about l-diversity is that it is not susceptible to attacks even when the data publisher is not aware of the kind of data an attacker can possess. The authors specify a T* table that consists of q* blocks. Their requirement is that each of the blocks needs to be l-diverse for l well represented values for the sensitive attribute.
Even though this is typical of a security paper, one of the things I liked was that the authors have specified the ways in which their obfuscated table could be susceptible to attack based on knowledge and ways to solve them. I also thought that using Bayes-optimal privacy was a smart choice considering the initial premise they presented because they were able to represent both diversity and background knowledge effectively.
In the end, the authors mention the situation where a l-diverse block would still violate the principle of l-diversity. A possible improvement that occurred to me is to break the published data into multiple parts, where all the columns do not necessarily overlap for multiple sensitive values. In that manner, you would have two different sets of information to present unless it is the kind of data that needs to be presented together. Also, another solution would be that maybe you do not need to represent individuals with specific properties for reports and ranges and max values or averages for specific requirements could be published.
l-Diversity: Privacy Beyond k-Anonymity paper review|
In this paper, the author introduced the limitation of K-anonymity privacy standard, and have shown that it is vulnerable to attacker who has background information for the sensitive data. K-anonymity is defined as follows: Information for each person contained in the released table cannot be distinguished from at least k-1 individuals whose information also appears in the release. When you try to identify information in the sensitive column, but the only information you have about non-sensitive column. The result found would be k column in the table with the same sensitive data as result. However, there are two possible attacks:
Homogeneity Attacks: if a certain combination of non-sensitive column only contains one kind of sensitive data, one can know for sure about all the non-sensitive column, he can predict the sensitive value;
Background knowledge Attacks: Even if there are multiple results in sensitive column value that can be match up with the non-sensitive columns combination, one can still use background knowledge to filter out impossible sensitive column value, and get the correct sensitive value.
Because of the above shortcoming of the K-anonymity privacy standard, the author put forward the new design of L-diversity based on the two Privacy Principles:
Positive Disclosure: Publishing the table T ⋆ that was derived from T results in a positive disclosure if the adversary can correctly identify the value of a sensitive attribute with high probability
Negative disclosure: Publishing the table T ⋆ that was derived from T results in a negative disclosure if the adversary can correctly eliminate some possible values of the sensitive attribute (with high probability)
And the L-diversity states that: In the case of positive disclosures, Alice wants to determine Bob’s sensitive attribute with a very high probability. And Equation 3 indicates that almost all tuples have the same value as the sensitive value and therefore the posterior belief is almost 1. To ensure diversity and to guard against Equation 3 is to require that a q⋆-block has at least l ≥ 2 different sensitive values such that the l most frequent values (in the q⋆-block) have roughly the same frequency. We say that such a q⋆- block is well-represented by l sensitive values.
To sum up, the author have also shown data that supports the claim that the performance and utility differences are minor between L-diversity and K-anonymity, while L-diversity provides stronger privacy. However, there is still a major drawbacks for L-diversity: it can only ensure privacy of one data column, and when the user requires more than one sensitive data column, they can’t get around with it.
In this paper, the author present a new privacy definition called l-diversity to deal with the problem of the k-anonymity has and provides a method to avoid potential problems.|
A table satisfies k- anonymity if every record in the table is indistinguishable from at least k-1 other records with respect to every set of quasi-identifier attributes; such a table is called a k- anonymous table. However, there are two simple attacks can attack a k-anonymized dataset.
(1) Attackers can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. For example, when all other k - 1 indistinguishable records have the same value.
(2) Attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. For example, the attacker can deduce the result from a smaller range of values.
The paper introduce a new privacy definition called l-diversity. l-diversity is a form of group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. Given an anonymized data a group must have at least l “well represented” values for the sensitive attribute to guarantee privacy from background knowledge of the attackers.
The paper introduce the weakness of k-anonymity using two simple attack examples, then the paper introduces their new privacy definition to deal with such problems. The l-diversity is easy to understand and the paper uses experiment result to show that l-diversity is practical and can be implement efficiently.
l-diversity can protect against the Homogeneity attacks, but cannot to protect the privacy against when the sensitive attribute values are distinct but semantically similar. There should be some strategy to semantically analysis about the meaning of sensitive values.
This paper discusses the issue of anonymity in public datasets. These datasets are often released for research or business purposes and may contain sensitive information about individuals. To prevent adversaries from identifying sensitive data about individuals, techniques that provide k-anonymity have been used. However, the authors demonstrate viable attacks against k-anonymity that take advantage of homogeneity of the data and background information. The authors introduce a new definition of privacy called "L-diversity". L-diversity protects against these kinds of attacks and the authors show that it can be used in practice.|
The authors classify attributes into two groups: sensitive and non-sensitive. The intuition behind L-diversity is that if an adversary queries the data using non-sensitive information, they should get sensitive attribute values that form a distribution with at least "l" well-represented values. This prevents the data from being homogenous and requires that the attacker be able to eliminate "L-1" values in order to discover sensitive data.
L-diversity is useful because it protects against attacks that are used against k-anonymity, the level of protection it provides is intuitively understandable, and it is usable in practice. However, as the authors admit, L-diversity is not rigorously defined in the mathematical sense as the phrase "well-represented" is not clearly defined. While they provide two "instantiations" of the principle, they don't have a mathematically rigorous generalization that captures both. It is also not clear whether alternative views can be provided into the data using l-diversity. The authors provide a way to process a dataset to provide L-diversity, but this may require discarding data tuples or attributes from the dataset. If alternative views of the data are generated by dropping different sets of tuples or attributes, can these be combined to undermine the privacy provided by L-diversity?
This paper describes why k-anonymity is not enough in modern databases and introduces the concept of l-diversity. A study has shown that about 87% of the population of the United States can be uniquely identified using characteristics such as gender, date of birth, zip code, etc. Therefore, a concept called k-anonymity was established. A table is k-anonymous if every record in the table is indistinguishable from at least k – 1 other records in the table. However, there are still attacks that can be carried out on k-anonymous tables. If Alice and Bob know each other and one day Bob falls ill, Alice is able to find out what illness Bob has through background information that she has on Bob.|
To counteract this phenomenon, we can create an l-diverse table. First, we will define a q*-block to be a block whose non-sensitive attributes generalize to q*. Then, we can create an l-diverse table by making each q*-block in the table l-diverse. For a block to be l-diverse, there must be l well-represented values for the sensitive attributes in the table. Now, Alice will not be able to find out Bob’s disease through a process of elimination because there are too many possibilities in every q*-block.
The positives of this paper are the following:
1. It provides a real world example about what attacks can occur on k-anonymous tables and provides a solution to resolve it.
2. The experiments provide a comparison between the k-anonymous tables and the l-diverse tables.
Even though the paper gives a good new way of releasing tables that can be anonymous, I still have a few concerns about the paper:
1. If the table is big enough, isn’t k-anonymity enough? With high probability, there will be multiple values for the sensitive data and l-diversity isn’t needed.
2. The paper has a lot of math about what k-anonymity and l-diversity are, but not a lot of examples. I would have liked to see more examples about the differences between the two.
This paper discusses l-diversity as an improvement over the level of privacy offered by k-anonymity. Privacy is an important consideration in data analysis. Companies and researchers want to be able to access census information and medical records in order to identify rends and patterns in data, but this data frequently contains sensitive information that needs to be protected. One of the first approaches to this was k-anonymity, a scheme in which all immediately identifying information is removed from data, and fields such as zip code and birth-date, which can act as quasi-identifiers, are generalized until the quasi-identifying fields of each tuple are identical to the quasi-identifying fields in at least k-1 other tuples in the table. The authors refer to a block of tuples whose quasi-identifying fields are identical as a q* block. Unfortunately, this technique is still prone to homogeneity attacks, in which a lack of diversity in sensitive fields causes information to be leaked, and background knowledge attacks, in which attackers with instance-level background knowledge can match people to their sensitive information with reasonable probability by using prior knowledge to rule out other tuples that are identical except for the sensitive field.|
In order to strengthen the privacy of released data, the authors of this paper introduce l-diversity. If a q* block of a table contains at least l > 2 sensitive values that appear with roughly equal frequency, the authors call this block well-represented by l sensitive values and say that it is l-diverse. A table is considered to be l-diverse if each q* block is l-diverse. An individual piece of relevant background information allows an attacker to rule out one of the possible values of the sensitive condition within a q* block. Therefore, l diversity provides a stronger level of privacy than k-anonymity by requiring an attacker to have at least l-1 pieces of relevant background information to be able to match someone to the sensitive information in the tuple representing them with reasonable probability. The authors point out that it is impossible to fully protect information from background attacks, but, by requiring l-diversity, they eliminate the possibility of homogeneity attacks and make background information attacks more difficult.
Throughout the paper, the authors define several different types of diversity, namely entropy diversity, recursive (c, l)-diversity, positive disclosure-recursive (c, l)-diversity, and negative/positive disclosure-recursive (c1, c2, l)-diversity. These different types of diversity describe the level of protection against various forms of information leakage, such as positive disclosure, in which an attacker is able to positively identify a sensitive attribute with high probability, and negative disclosure, in which an attacker is able to eliminate some possible values of the sensitive attribute with high probability. The authors also note that we must consider cases in which there are multiple sensitive attributes. In these cases, each sensitive attribute must be treated individually, considering all other sensitive attributes as quasi-identifiers for the purpose of preserving privacy and diversity for every individual sensitive field.
My primary concern with this paper was that many of the different ways of measuring diversity (entropy vs. recursive vs. positive disclosure recursive, etc.) were covered very quickly and in very little detail. Even if readers were to spend a great deal of time examining the equations for each, they would be unlikely to understand all the reasons to use one measure of diversity over another. Additionally, I found it interesting that, when the authors measured the utility of k-anonymous vs. l-diverse datasets, they often found that the l-diverse datasets had lower utility, having traded some usefulness for privacy. This is intuitive, but indicates that this scheme could be further refined in a way that provides the added privacy of l-diversity without sacrificing utility.
This paper introduces l-diversity, which is a privacy definition that can provide more secure tables. The motivation for proposing l-diversity is that a popular definition called k-anonymity would be attacked in some situation. Thus, this paper talked about the situation that k-anonymity is mot secure, and the new l-diversity definition.|
First, the paper talks about k-anonymity and two attacks to this definition. In a k-anonymity dataset, each record is indistinguishable from at least k-1 other records with respect to certain identifying attributes. However, there are two attacks that can compromise k-anonymous dataset, including homogeneity attack and background knowledge attack. In homogeneity attack, k-anonymity can create groups that leak information due to lack of diversity in the sensitive attributes; in background knowledge attack, datasets can be attacked based on some knowledge of the k-anonymity. Thus, from the above discussion, k-anonymity is not secure enough for preventing attacks.
Second, the paper introduces a new secure dataset definition called l-diversity. It has some advantages. l-diversity no longer require knowledge of the full distribution of sensitive and nonsensitive attributes. In addition, l-diversity does not require the data publisher to have as much information as the adversary, and instance-level knowledge is automatically covered.
The strength of the paper is that it provides detailed proof of the principles of l-diversity definition, which can make readers more clear about their ideas and detailed knowledge. In addition, before the paper introduces l-diversity, it first talks about the situations that k-anonymity is not secure. This can let readers know the motivation of l-diversity.
The weakness of the paper is that it provides few real dataset examples to illustrate l-diversity when describing the principles of l-diversity. While the math proof provides theoretical foundations, I think it would be better to provide some dataset examples to illustrate the idea.
To sum up, this paper proposes l-diversity, which is a privacy definition that can provide more secure tables.
This paper is an introduction to l-Diversity, which is a privacy metric supposedly better than k-anonymity. K-anonymity is meant to provide an anonymity that every attribute is indistinguishable from k-1 other attributes. The purpose of this is so that you can store data in a table and still have the information you need, but an attacker can not figure out who each row is about. As you can imagine this is helpful in many settings, one example is in the healthcare field. |
K-anonymity is achieved by hiding certain data fields and obscuring others to accompany a wider range or results, in order to keep the sensitive data columns separate from the individuals they correspond to. The paper has a great example of this on the top of page 2. The problems that this paper starts out with are k-anonymity is vulnerable to two types of attacks: Homogeneity attacks, and background knowledge attacks. Homogeneity attacks are when there are too many entries of the same type so with limited knowledge you can be assured what tuple corresponds to an individual. Background knowledge attacks work somewhat the same way but often require more background knowledge of the individual to distinguish them. For example, if age is grouped into buckets and zip codes are cut off after the first 3 values but you know someone’s age and zip code and potentially other obscured information about them, you can eliminate enough columns to figure out which sensitive data they correspond to. Those are the problems with k-anonymity that this paper points out and sets out to solve.
L-diversity is pretty similar to k-anonymity if you ask me. It is basically obscuring columns again such that there are l well represented values for every sensitive attribute S for a given q* block. A table is l-diverse if every q* block that makes up the table is l-diverse. Figure 3 on page 7 in the paper is a pretty good example of l-diversity and if you look at it and think of how much information you would need to know to identify something sensitive it is more than the k-anonymity example over the same data, but I am somewhat skeptical that’s not just a poor obscuring job for the k-anonymity example.
I was not super impressed with the results in this paper, but perhaps I am missing something. I was expecting l-diversity to be better by at least an order of magnitude given that k-anonymity was somewhat slandered in the beginning of the paper and they claimed to have a solution. It was not however an order of magnitude better, and in most metrics not even close, they were pretty comparable actually. Also I’m not good with theory papers generally and this paper had a fair bit of mathematical theory behind it.
Overall I think this was an interesting paper. I’m not sure the issue it is addressing is a massive issue and I’m also not sure that their solution is really much better. I guess it’s an improvement so that’s good but not major enough to get me excited, and I was not thoroughly convinced that the issues of k-anonymity are exploited often anyways. Overall, this was a solid paper and worth the read, but I’m not particularly convinced it is a game changer (though it’s high citation count would disagree with me).
This paper appears to be different from many of the papers we read before; the motivation behind this publication is to showcase the insecurity of “k-anonymity,” a concept that relies on some individual’s records being indistinguishable from k other individual records. For databases that do not explicitly name records, the fact remains that many individuals can be uniquely identified with attributes such as DOB, zip code, gender, etc. Attackers with background knowledge of intended targets can use their own information to leverage an attack to gain information on a specific individual. The authors of this paper demonstrate the insecurity of this concept, and propose a new solution called l-diversity to extend k-anonymity with further security.|
The first breach of privacy comes from a homogeneity attack; in a k-diverse table, while certain sanitized attributes may hide information, if there is a lack of diversity in some sensitive attribute, then if they all are the same for the set of k-anonymized records the sensitive value for ALL k records can be exactly predicted. The second attack, background knowledge attack, leverages information (e.g. statistical information) of quasi-identifier attributes to narrow a search space for the sensitive attribute desired. The example they gave is that “heart diseases occur in reduced rates in Japanese patients” rules out the possibility of a suspected japanese victim having their records identified. Given these two attacks, there needs to be a better method to ensure privacy even with this possibility.
Thus they begin to build their model under the assumption that an attacker has access to i) the anonymized table T*, ii) instance level background knowledge (i.e. the “neighbor” example), and iii) demographic background data (i.e. the “japanese demographic” example). Their first goal is to reason about privacy with bayesian inference techniques. For example, an attacker’s prior (estimates given only background knowledge) should not be different from the posterior (estimates that can be boosted given access to the k-anonymized table T*. If the attacker can eliminate a range of possibilities or guess the value for some sensitive attribute, then information is disclosed and the point is lost. However, to exactly implement Bayesian optimal privacy is difficult in practice, since the data is assumed to be a random sample of the population and it may be difficult to replicate the distribution of some sensitive value across an entire population.
My understanding is that this reason, coupled with the difficulty in modeling instance-level or background knowledge, is why a data source cannot simply populate T* with dummy values to “insert” diversity in to the dataset. However, I am unsure of whether or not this is immediately flawed, it seems like a good idea to me. The idea they propose, however, is a redefinition of how to partition a k-anonymous table to partition data. A table T* would be l-diverse if each partition q* has l “well-represented” examples within the partition. Then if an attacker has access to some partition q*, they need l-1 different pieces of damaging background knowledge to even infer positive disclosure (remove many possibilities for sensitive values), and negative disclosure is resolved inherently.
The first variation, the naive “distinct l-diversity,” does not prevent against probabilistic inference attacks, which is why I think they go on to describe the different instantiations of l-diversity. In order to prevent probabilistic inference attacks, they want the cross-entropy log-likelihood to be at some threshold for each sensitive attribute. However, to make a table entropic with factor log(l) may be too expensive and inefficient in practice. Recursive diversity is essentially the partitioning of a q* block into different attributes is not dominated by a single attribute by a factor of each other attribute in sequential order. However this may reveal some information regarding certain attributes (which may be okay, they say, if the attribute is something like “healthy”). I am not sure how I feel about this point, since saying “okay revealing some attributes is okay” basically goes back on the whole point of securing T* in the first place. Though I can see its benefit in that it gives you some choice over which subset of attributes to make insecure.
I think this paper made some interesting points, especially in the evaluation of k-anonymity as an insecure platform for data privacy. L-diversity does look like a more secure privacy methodology as well, but I wonder if there is a tradeoff between having to manually partition and enforce l-diversity and simply inserting dummy records to avoid information breaching. Regardless, the reasoning they gave for bayesian privacy made sense, but the paper discusses mostly in the case of one attribute; what about correlation between sensitive attributes, and information that can be gained from identifying causal factors in sanitized data?
This paper talks proposes a new method of privacy definition called “l-diversity”. In the data privacy field, the k-anonymity method is still vulnerable to breach. While k-anonymity emphasize on maintain the anonymity of the data in the table, the l-diversity stress on maintaining a diverse distribution of generalized data throughout the table. This paper shows the weaknesses of k-anonymity method and presents the l-diversity method (and how it actually guarantees more privacy than k-anonymity). |
The paper starts with explaining how k-anonymity privacy is vulnerable to homogeneity attack (where the adversary knows several non-sensitive information which, when combined, can be a significant clue) and background knowledge attack (where the adversary knows the general condition that is applied to the person whose data is being breached). The paper then introduces initial model and notation. After that, the paper mentions Bayes-optimal privacy, an ideal notion of privacy that should be achieved but it is impossible because it requires a complete information about the world (the distribution of sensitive values, knowledge of the level of adversary’s knowledge, instance-level knowledge, and possibilities of multiple adversaries). Then the paper defines the l-diversity principle, in which “a q*-block is l-diverse if contains at least l ‘well-represented’ values for the sensitive attribute S. A table is diverse if every q*-block is l-diverse”. The l-diversity bypasses the Bayes-optimal privacy limitation because it no longer requires knowledge of full distribution of sensitive value, no longer requires data publisher to have all the knowledge about the adversary nor the instance-level knowledge, and no longer needs to mind the possibility of multiple adversaries. Next, it mentions about implementing privacy preserving data publishing, in which it explains that Bayes-optimal privacy does not satisfy the monotonic property required. Experiments with real world data (Adult Database and Lands End Database) show how l-diversity performs (in terms of functionality and performance) compared to k-anonymity. The paper briefly mentions about multiple sensitive attributes topic that is open for future work.
There are several contributions from this paper. First, it shows that k-anonymity can be breached due to lack of diversity in the sensitive attributes. Second, and most important, it explains the concept of l-diversity that not only gives stronger privacy guarantees, but also practical and easy to use (note that it is easy to adapt k-anonymity algorithm to l-diversity). With l-diversity, it is also possible to achieve security while bypassing the limitations of Bayes-optimal privacy.
However, what if the data distribution is not uniform? We could not expect to firm uniformly distributed data in real world. While the experiment used real-world data, it does not explain the distribution of data beside the diversity domain size. Also, in the example, I still do not understand the generalization decision for age group. When it comes to data range, how the algorithm came to that decision?--
The purpose of this paper is to show that there are major security holes in an existing method thought to protect sensitive user data called k-anonymity. The authors then present a new privacy definition along with an implementation and performance comparisons called l-diversity. |
The technical contributions of this paper are numerous. First, it presents two cases in which an existing highly used method is vulnerable to attacks and sensitive data compromise. It then presents a formal definition of the problem. Additionally, the paper goes beyond Bayes-optimal policy, which is not feasible in a real system due in part to the fact that the policy lacks monotonicity. They present l-diversity which ensures that every block of data we define has at least L distinct values of the sensitive attribute in them. Additionally, the method they present to create these blocks with l-diversity has numerous advantages in that it handles all cases of instance-level knowledge that an attacker might have as well as not requiring hte data publisher to know as much as the adversary might. Finally, this paper contributes an implementation of l-diversity as well as performance results in terms of defending against attacks as well as runtime performances.
I think the paper is very strong. We expect these kinds of security principles to have a strong grounding in theory (and be provably effective) and this paper does an excellent job of presenting concepts from the theory up. Though it is heavily grounded in theory, the paper also presents the reader with several real-world examples so as not to lose them in the symbols and proofs presented. Additionally, all of the assumptions made in various sections of the paper are clearly stated and justified to the reader. I think this is key in a security-based system since any assumption that differs from what might happen int he real world is crucial to acknowledge.
As far as weaknesses go, I don’t think this maper has many. I wish there were more of a wrap-up discussion in the conclusion section, but that isn’t a weakness with the concept itself, only the paper formatting. I think L-diversity is a strong concept that is provably better than k-anonymity.
This paper mainly proposed l-Diversity, a better privacy definition than the widely used privacy definition k-Anonymity. K-Anonymity utilize the quasi-identifiers to protect the private data. The paper introduces two attack examples of k-Anonymity to prove its vulnerability. The two examples, first called Homogeneity Attack, attack k-Anonymity on its lack of diversity in the sensitive attribute; the second attack, called Background Knowledge Attack, utilize the fact that k-Anonymity does not protect against attacks based on background knowledge. Moreover, this paper also gives several definitions of the notation of the privacy model, including some basic notation and anonymized Table, Domain Generalization etc. Besides the k-Anonymity, it also introduce the privacy definition of Bayes-Optimal privacy, as another comparison for the proposed l-Diversity definition. The Bayes Optimal Privacy has many limitations because it is built on two simplified assumptions. One of the limitations is that it has insufficient knowledge since it is hard to know the sensitive and non-sensitive distribution of the data and also the knowledge from its adversaries. L-diversity mainly overcomes this weakness, where it does not need the data publisher to know the information from its adversaries. This paper gives the definition of the l-diverse that if every block it contains has l well-represented attributes. Well-represented is defined from the desired level of anonymity required. In order to implement l-diversity, entropy, recursive diversity, positive or negative disclosure recursive diversity can be used. This paper also gives a complete experiment to measure the difference between l-diversity and k-anonymous, and shows that l-diversity can address all the attack issues yet have a relevant comparable performance with k-Anonymity.
1. This paper introduces l-Diversity, a practical privacy definition has better protection than k-anonymity and more practical than Bayes-optimal privacy.
2. This paper gives a great introduction on its background and related works. It gives comprehensive introduction on k-Anonymity and Bayes-optimal Privacy. Moreover, it also gives several important definition of the notations, which is very helpful for understand the concept of the following topics.
3. This paper conducts a great comparison experiment between l-diversity and k-Anonymity.
1. One weakness of the l-Diversity is that the performance of l-Diversity is worse than k-Anonymous especially using the entropy based implementation.
2. The l-diversity is still based on the distribution of the data for its diversity, so this privacy definition would be weak if the data is skewed.
This paper discussed the privacy issue of k-anonymized dataset and proposed an enhanced privacy definition called l-diversity. l-diversity preserves privacy in data sets by reducing the granularity of a data representation. It sacrifices the effectiveness of data management or mining algorithms in order to gain more privacy. The l-diversity is developed based on the k-anonymity model. The k-diversity guarantees that each|
record is indistinguishable from at least k−1 other records. However, it can be compromised by at least two attacks: the homogeneity attack and background knowledge attack. The homogeneity attack leverages the case where all the values for a sensitive value within a set of k records are identical. In such cases, even though the data has been k-anonymized, the sensitive value for the set of k records may be exactly predicted. Background knowledge attack is a technique that takes advantage of relation between one or more quasi-identifier attributes with the sensitive attribute to reduce the set of possible values for the sensitive attribute. l-diversity mainly leverages the technique of domain generalization. The ultimate goal is to conceal each individual tuple into an appropriately constructed group of data, in a way that an attacker
cannot easily reason about the participation of individuals into the group. This paper introduced the ideal privacy notion called Bayes-optimal and showed that Bayes-optimal privacy naturally leads to a novel practical definition that we call ℓ-diversity. ℓ-Diversity provides privacy even when the data publisher does not know what kind of knowledge is possessed by the adversary. The main idea behind ℓ-diversity is the requirement that the values of the sensitive attributes are well-represented in each group.
The strength of this paper is the observation of privacy concern on k-diversity. This paper reveals that k-diversity can be compromised by Homogeneity Attack and background knowledge attack, thus giving an insight of the direction where people can improve k-diversity.
Though the authors claimed that l-diversity can be implemented efficiently, the performance of the anonymization process is not satisfactory.