Word sense disambiguation test results: The first 200 bytes of enhanced gloss text for each noun in WordNet3.0 was used as input text to check for correctness of disambiguation. The word itself as well as the headword of the synset were removed from the gloss text prior to performing the test. All person related nowns were also removed: mother, father, man, woman, child, etc. The enhancement in this case was to include gloss text for all hyponyms of nouns -- up to 4 levels deep. All gloss text was concatenated into a single long paragraph before the first 200 bytes were selected. No nouns whose lexeme contained special characters or blanks were tested -- the tools I used for this experiment simply ignores all characters outside the [a-zA-Z] set. There are 63080 noun word senses that comprise the test. This includes nouns with only one word sense. Total disambiguation failures is 6721. That is a success rate of about ~89%. The tables used for disambiguation were created entirely automatically except for the manual selection of 58 subject matter corpora which served as a concept space universe. The documents populating these corpora were extracted entirely from Wikipedia and amounted to a recursive dump of all pages referenced by the top level subject word -- down 4 levels deep. More on this below. Of the 6721 failures, 3190 seem to be the result of an as yet undiagnosted program bug leaving no head word in the disambiguation tables. These may or may not be correctable flaws. If they are, that would bring the success rate to 94% but who knows what the problem will ultimately be. The input test data set has 25630 noun word senses which are not word sense 1. There were 3451 failures among this set of input data and the success rate is ~87%. Of these failures, 1524 are due to the missing headword bug and if I can fix them all, this would raise the success rate among polysemous words to ~92%. Sense disambiguation approach: No machine learning is used in this approach. Instead, tables were built based on super-enhanced sense gloss words. The process for doing the super-enhancement will be discussed below. The super-enhanced gloss text is then converted into two different kinds of tables and both tables ultimately contribute to the final disambiguation process. To disambiguate nouns in a block of text, non-person nouns and verbs from a given document sample are used to sort word senses based on the first of the two tables. The highest three senses are then resorted using data from the other table. The highest ranking sense from this second sort is returned as the final sense selection. Obviously, if the number of senses is less than 3, then first selection process did not contribute to the result. The two kinds of tables are as follows: Table 2, used in the final stage of selection, is nothing more than a word frequency table for the well known words in the super-enhanced sense glosses. Well known words are those that appear in word net. Actually the well known word set comes from several sources not just word net and has been hand edited to provide additional information, such as the baseform of words like running -- its baseform is run. The final stage of selection in the sense disambiguation process is to compare the words in the test fragment to the word frequency table for the final 3 candidates which are obtained after step 1 of the selection process. Table 1 is more complicated to explain. For every sense of every noun in wordnet, table 1 contains a direction in concept space expressed as a vector of similarities between the word sense's super-enhanced gloss text and the words in a corpus collection discussing each concept. There are 58 concept specific document collections. The words in a sense's super-enhanced gloss text are compared to the words in that collection to produce a cosine similarity measure between 0 and 1. Thus, table 1 holds, for each sense of each noun in wordnet, a vector of 58 numbers that express how that word sense relates to the universe of concepts defined by the 58 selected corpora. The first stage of the selection process is to compare the concept space direction for each of the senses of a word to the concept space direction to the comparand text -- selecting at most the top 3 closest matches. The subject matter corpora used to compute the directions in concept space are as follows: airplane animal automobile beauty building cat church city clothing dog education emotion employment entertainment environment fitness food government group health holiday land meal mosque ocean office organization person pet plant politics power racism recreation religion road science ship society synagogue temple ugliness vegetable warfare The text used in these corpora were extracted from Wikipedia and then enhanced by recursively extract text for the pages referenced as links by the wikipedia pages -- up to 4 levels deep. Also, documents that heavily affected the uniquess of the corpora were removed prior to table generation. That is, if a document appeared in two subject corpora and caused the cosine similarity comparison of the two subjects to be at least 0.5, then the document was removed from one or both of the corpora. Other manual filtering of documents was done along the same lines. Super enhancing word sense glosses: Word sense glosses were originally extracted from word net directly. Person words and the word defining the sense and the head word of the synset for the sense were removed from the glosses. The next phase is to enhance the glosses by including the glosses from the hyponyms of the child words. Again excluding the top level word and the synset's headword from the result. This file is 215072124 bytes in size. Word frequency tables were then built for each sense of each word. With these tables, cosine similarity comparisons could be made to words in documents extracted from Wikipedia. A corpora related to each word (but not necessarily each word sense) was built by asking Wikipedia for its page on word then further extracting all pages linked to that word -- up to 4 levels deep. Each document was stripped of its html formatting and the remaining text was formed into separate paragraphs. To super-enhance a word sense's gloss information, two stages are performed: In stage 1, only the normal documents from Wikipedia are processed and paragraphs are selected using cosine similarity and their text is included in the gloss information. The process is done repeatedly so that the enhanced gloss grows in a direction defined by the previously selected paragraph contents. In stage two, the special "list" pages from Wikipedia are processed. When a list page is deemed to be appropriate using cosine similarity, the entire document is included in the gloss. For example, if you want names of all dog breeds to show up in the sense tables for sense 1 of the word dog, you need to get the list of breeds from somewhere. In Wikipedia, list pages handle this kind of bulky tabular information. The super-enhanced file size is 303979475 bytes.