My goal here is to propose a research challenge for Large Language Models (LLMs), a major AI technology these days. This challenge should (a) use AI methods to provide power to individuals, making their lives better in a tangible way; and (b) drive research in LLM applications to improve the state of the art in AI. A bonus will be if (c) progress in this area results in a successful, profitable, and influential enterprise.
Everyone who explores websites encounters Privacy Policies, which explain what the website owners do with personal data collected from its users. These policy statements are typically very hard to read, due both to dense and difficult legal language, and often to dense and difficult fonts. Getting to the interesting content of a website typically requires agreeing to the Privacy Policy, so the vast majority of users simply click the “Agree” button without reading the actual Privacy Policy. The website often states that clicking “Agree” is a legally binding contract, but it is not clear whether this is legally correct.
Nonetheless, many people (including me) are uncomfortable about clicking “Agree” to a contract without reading it, possibly exposing myself to some unknown liability. My life would be better, in a small but tangible way, if I could use an AI tool to read each Privacy Policy for me, and advise me (in a way I can trust) whether it is OK for me to click “Agree”. This is subgoal (a) from the first paragraph.
The research challenge is to build and train an LLM that would read and understand Privacy Policies. We’ll call this LLM-based system the Privacy Policy Expert (PPE). Many PPE tasks overlap with tasks involved in understanding the closely related Terms of Service (ToS) document, suggesting that after building the PPE, we go on to build a ToSE.
The first major task of the PPE is to identify commonalities among Privacy Policies. As a human reading these policies, it is clear that most clauses and paragraphs are perfectly acceptable. I promise not to use their website to break laws, to do harm to others, or to damage their property or resources. They promise to do their best to satisfy my interest in their website without assuming any liability for any failures on their part.
The research challenge in this first task is to identify these commonalities across significant differences in the legal language used. The PPE should also be able to summarize and explain these to the user in clear and easily understood language, at various levels of detail according to the user’s questions. This requires genuine understanding and reasoning with legal language and terminology, beyond what is achievable by recognizing common sequences of words. The need to do this understanding and reasoning pushes the state of the art in AI, which is subgoal (b) in the first paragraph above.
The second major task of the PPE is to identify anomalies in a Privacy Policy. Often, these are unusual provisions within the policy. One legendary Privacy Policy included a deeply buried paragraph suggesting that the Reader send a message to a certain email address to be entered into a lottery for a $10,000 prize. A certain Reader read this paragraph, sent the email, and a few months later received a check for $10,000, as the only entrant to that lottery. (I doubt this will be repeated, but wouldn’t it be nice to have a PPE to find that paragraph and suggest that you read it?)
Far more likely, the Privacy Policy may include something like a paragraph saying that the owner promises to protect your personal information, but in case of bankruptcy and sale of the website, your personal information is among the assets to be sold to cover debts, and the buyer will have no obligation to protect your information. Or that any information you enter on the website becomes the intellectual property of the owner. Or something else that might violate state, national, or international laws protecting users including you. Identifying such cases contributes to subgoal (a) from the first paragraph.
Of course, the ability to recognize such cases across differences in legal language is another contribution to subgoal (b). This requires the ability to understand the “legal truth” of a particular statement, not just whether that statement commonly appears in the training data. Critically reading statements for their correctness is an important capability that we want from AI, that goes beyond what LLMs today can provide.
Fortunately, we don’t have to make this major advance in AI understanding for the PPE to be useful. We expect the PPE to identify anomalies that require human examination and understanding. Assume that 90% or more of a typical Privacy Policy consists of perfectly reasonable commonalities. It is a big step forward for the PPE to highlight the remaining 10% for the user to examine and consider.
And here we start getting into the role of the enterprise that creates, develops, and maintains the Privacy Policy Expert (PPE). Once the PPE reaches a level of performance that users find useful, each user and each Privacy Policy provides feedback to the developers that directs attention to specific legal language, terminology, and constraints that need attention. This is a virtuous feedback cycle that improves the PPE.
Some anomalies require developer attention to improve the technology. Other anomalies may require legislative or regulatory attention from authorities. By collecting data about the appearance of such clauses, the enterprise can provide valuable evidence of the need for particular regulations or legislation that might otherwise fly completely below the “radar” of public perception. This begins to address subgoal (c) in the first paragraph.
Since the training data consists of Privacy Policies, much of it is openly available on the World Wide Web. Researchers at Pennsylvania State University [Srinath-de-23] estimate that there are at least 3 million Privacy Policies available online. These researchers also estimate that the chance is only about 34% that, from a given website, a user would find the URL for a privacy policy, and perhaps 3% of those would not successfully link to a privacy policy.
Based in part on this 34% availability, and on other estimates of the number of websites on the Internet, Perplexity AI estimates that there may be upward of 150 million Privacy Policies on the Internet. This may well be an over-estimate.
Nonetheless, this is a useful starting point.
In 2018, Harkous, et al, addressed important aspects of this problem, using deep neural network methods, but before the LLM revolution. They created and applied a “privacy-centric language model”, getting promising results. But now, with well-developed LLM technology available, much more is likely to be possible.
More recent work build on increasingly well-developed LLM technologies.
I believe that the research challenge of implementing and deploying a Privacy Policy Expert (PPE) is an attractive “low-hanging fruit” that could help people significantly, could help move artificial intelligence research forward, and could create an impactful enterprise. Success at this challenge will require creativity, insight, intelligence, and effort. It will not require the vast computational or data resources that limit certain kinds of research to extremely large companies.
If you can do this, you will change the world, in a small but significant way.
The Privacy Policy Expert is "low-hanging fruit" in a very large orchard. Once you have assembled the expertise and resources to harvest this specific fruit, what's next?
The obvious next step is similar expertise in the Terms of Service that are also on many websites, and that most people also click through without reading or understanding. A wider variety of clauses appear in Terms of Service, compared with Privacy Policies, but the expertise and resources developed for Privacy Policies are a very good start toward this second goal.
After this, there are a truly vast number of specialized legal agreements: loan agreements, lease agreements for cars and apartments, non-disclosure agreements, plea-bargain agreements, and on and on. These are often written in language that ordinary people without legal training can't understand, and often printed in fonts and formats that discourage reading, much less careful and thoughtful reading. People facing these agreements, often under significant pressure of time, money, and other commitments, simply sign without reading. Just like we "agree" to Privacy Policies.
You could immediately start work on the larger problem of building an AI attorney, but you would quickly be overwhelmed. The PPE is offered as a "starter project" with a much lower barrier to entry, and the possibility of creating a product quickly that would actually help real people. Good luck!