Proposal for Senseval Scoring Scheme I. Dan Melamed & Philip Resnik ---------------------------------------------- 1. A principled evaluation metric can be derived by assigning probabilities over sense tags output by WSD algorithms. Algorithms that output multiple tags but do not assign probabilities can be treated as assigning uniform probabilities over the tags that they output. E.g. an algorithm that considers senses A and B as possible, but eliminates senses C, D and E for a word with 5 senses in the reference inventory is really saying: sense prob. ----- ----- A .5 B .5 C 0 E 0 D 0 2. Given a probability distribution over sense tags and a single known correct tag, the algorithm's score should be the probability that the algorithm assigns to the correct tag. Note that the exact-match criterion falls out as a special case of this metric, where the algorithm selects exactly one sense, which is equivalent to assigning 100% of the probability mass to it. 3. Given multiple possible correct tags for a given word token, the algorithm's score should be the sum of ALL probabilities that it assigns to ANY of the correct tags. The premise here is that it is impossible to tell whether a multi-sense annotation was intended as disjunctive or conjunctive, so algorithms should be given the benefit of the doubt. E.g. annotator tags a word with senses A and B: algorithm's output score ------------------ ----- A 1 .5 A; .5 B 1 .3 A; .7 C .3 .3 A; .4 B; .3 C .7 A, B, C 2/3 4. The probabilistic treatment of sense tags can be extended to handle tree-structured tagsets, such as HECTOR, if the structure is interpreted as an IS-A hierarchy. E.g., if sense 3.2 is a sub-sense of sense 3, then any word token of sense 3.2 *also* IS-A token of sense 3. (Further extensions exist for more general DAGs, such as WordNet, but they don't concern us here.) The same scoring criterion can be used for structured tagsets as for unstructured ones: What's the probability that the algorithm assigns to any of the correct tags? The complication for structured tagsets is that it is not obvious how to compare tags that are in a parent-child relationship. This problem can be solved by defining two kinds of probability distributions: Pr(occurrence of parent sense | occurrence of child sense) and Pr(occurrence of child sense | occurrence of parent sense). The first one is easy: In a tree-structured IS-A hierarchy, Pr(occurrence of parent node | occurrence of child node) = 1. The second one is harder, unfortunately; in general, these ("downward") probabilities are unknown. In the absence of prior knowledge about sense distributions over particular sense-tree branches, the maximum entropy principle dictates that we assign a uniform distribution over Pr(occurrence of child sense | occurrence of parent sense) for each sense. Fortunately, this is not such a bad assumption. It will be false in most individual cases, but if we evaluate WSD algorithms by averaging performance over many different word types, most of the biases should come out in the wash. Now, how do we use these conditional probabilities for scoring? Treat each non-leaf sense-tag as underspecified. E.g. if sense 3 has just the two subsenses 3.1 and 3.2, then tagging a word with sense 3 is equivalent to giving it a probability of one half of being sense 3.1 and one half of being sense 3.2, given our assumption of uniform downward probabilities. This applies both to the tags in the output of WSD algorithms and to the manual (correct, reference) annotations. Example: Suppose our sense-tree for a given word has senses 1 and 2, which are not subdivided in any way. It has sense 3 divided into 3.1 and 3.2, as above. In addition 3.1 is subdivided into 3.1a and 3.1b. There is also a sense 4, which is split into 4.1, 4.2 and 4.3. Under our assumption of uniform downward probabilities, we start by deducing: Pr(3.1 | 3) = .5 Pr(3.1a | 3.1) = .5 (and so) Pr(3.1a | 3) = .25 Pr(4.2 | 4) = 1/3, etc. If any of the conditionals above are reversed, then the probability is always 1. E.g. Pr(3 | 3.1a) = 1. Next, we apply these probabilities to find Pr(any correct sense | algorithm's assigned senses) as in the following example cases: manual annotation algorithm`s output score ----------------- ------------------ ----- 2 3 0 3 3 1 3 3.1 1 3 3.1b 1 3.1 3 .5 3.1 & 3.2 3 .5 + .5* = 1 3.1a 3 .25 3.1a & 4.2 4 1/3 (= Pr(4.2 | 4)) 3.1a & 4.2 3.1 .5 3.1a & 4.2 3.1 & 4.2 .5*.5 + .5*1 = .75 3.1a & 4.2 3.1 & 4 .5*.5 + .5*.333 = .41666 5. This scoring scheme depends only on the tree structure of the hierarchy, and not on the types of nodes in it. In particular, questions of whether part-of-speech and other syntactic distinctions should be part of the sense inventory are orthogonal to the issue addressed here. ========================================================== References: This proposal incorporates some ideas from section 3.1 of Philip Resnik and David Yarowsky, "A perspective on word sense disambiguation methods and their evaluation", position paper presented at the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, held April 4-5, 1997 in Washington, D.C., USA in conjunction with ANLP-97. available from http://www.umiacs.umd.edu/~resnik/papers/siglex97_perspective.ps