Replies: 2 comments 4 replies
-
|
We only extract data, and if there is no explicit data for whether a word is a lemma it's hard to tell what 'should' and 'shouldn't' be from the perspective of Wiktionary. There is no way to do it formally with Wiktionary data because non-lemma forms have their own entries as well. The closest you can get is by looking for |
Beta Was this translation helpful? Give feedback.
-
|
FYI this is my current regular expression pattern for determining if a word is not a lemma. I might be too conservative with the "alt-of" tag, as here I'm only counting alt-of that are misspellings. But I think perhaps I should just include all alt-of tags. Advice? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In WordNet, every LexicalEntry is a lemma. In wiktextract, a line (json object) in the file could be a lemma or not. There are many potential fields that could be examined to determine this. Is there a "cookbook" or "recipe" that answers these kinds of questions? For example, for which parts of speech does the concept of lemma/not a lemma apply? Are names (proper nouns) considered lemmas? Can compound words (hyphenate, space delimited, etc.) or phrases be lemmas? What fields, tags, keywords etc. in the json object define this categorization of a word?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions