Wikidata:Property proposal/WordNet 3.1 Synset Id
WordNet 3.1 Synset Id
editOriginally proposed at Wikidata:Property proposal/Authority control
Description | Synset identifier in Princeton’s WordNet Version 3.1 |
---|---|
Represents | WordNet (Q533822) |
Data type | External identifier |
Domain | item |
Allowed values | \d{8}\-[nvarsp] |
Example 1 | dog (Q144) → 02086723-n |
Example 2 | pawl (Q55629301) → 03907626-n |
Example 3 | hot dog (Q181055) → 07692347-n |
Source | https://wordnet.princeton.edu/ |
Planned use | There is already word-sense disambiguation software that produces WordNet synsets. This property would allow such software to target Wikidata. |
Number of IDs in source | 117,000 |
Expected completeness | eventually complete (Q21873974) |
Formatter URL | http://wordnet-rdf.princeton.edu/id/$1 |
Robot and gadget jobs | See below |
See also | Interlingual Index ID (P5063), BabelNet ID (P2581) |
Motivation
editWordNet is a substantial and widely-used set of concepts. Having a mapping between Wikidata and WordNet would assist those who want to use Wikidata for word sense disambiguation.
I note two previous proposals from two years ago: Wikidata:Property_proposal/Wordnet_synset_ID and Wikidata:Property proposal/WordNet ID. The first of those gained significant support, but was withdrawn because of issues about how WordNet ids changed in different versions, and because of questions about the part-of-speech being required to convert the id into a URL. The second proposal was opposed because its relationship to the prior proposal was unclear and because of the forthcoming integration with Wiktionary.
Regarding the issue with versions, this proposal solves that problem by being specific to one version, and uses the "offsets" for that version. (Arguably, we ought to have properties for both 3.0 and 3.1, the two major versions in use today, but only 3.1 is proposed here.)
The issue of the part-of-speech being part of the URL is resolved by making it part of the identifier.
This property is not directly related to the lexicographical data because these identifiers are for lexical concepts, which are better modelled as Wikidata items. WordNet has a separate namespace for lemmata, e.g. http://wordnet-rdf.princeton.edu/lemma/dog .
As can be seen in this query, we already have 188 mappings to WordNet 3.1 using exact match (P2888). While these are usable, it is better to have a specific property. These can be used to populate the property. Mappings between WordNet 3.0 and 3.1 and with other resources like ILI and BabelNet are available from various sources.
I anticipate being able to populate the property with hundreds or thousands of values with a few rounds of QuickStatements that do some of the obvious steps to populate the property from the existing data. This might need to be run from time-to-time, but probably does not require a bot.
Bovlb (talk) 18:23, 5 August 2020 (UTC)
- Regex and example 3 tweaked per Eihel below. Bovlb (talk) 15:47, 24 August 2020 (UTC)
Discussion
editNotified participants of WikiProject Linguistics
- Comment since this is a dictionary, would it not make more sense to (also) connect Lexemes such as dog (L1122) to WordNet? Second, in this discussion the argument seems to be that Interlingual Index ID (P5063) is a better fit for Wikidata than Wordnet since ILI provides stable identifiers. For example http://rdf.cltl.nl/ili/i46360 already links to both Wordnet 3.0 and 3.1 so I wonder whether it would make more sense to focus on ILI only and have Wordnet through ILI? --Hannes Röst (talk) 21:21, 5 August 2020 (UTC)
- @Hannes Röst, Mahir256: While I have no objection to attaching WordNet's lemmata to our lexemes, it is my view that WordNet synsets (ontolex:LexicalConcept in RDF) are more appropriately mapped to items. Bovlb (talk) 16:50, 10 August 2020 (UTC)
- @Hannes Röst: ILI is an excellent resource, but I note that https://github.com/globalwordnet/ili/blob/master/ili-map-pwn31.tab has only 117,610 lines whereas WordNet 3.1 has 175,979 synsets, so it is clearly not complete. I see no reason we cannot have explicit linkage to a widely-used (and widely-cited) resource just because incomplete indirect linkages exist. Do we not aspire to be a hub for such things? And WordNet identifiers are stable if we fix the version number, as this proposal does. Bovlb (talk) 16:50, 10 August 2020 (UTC)
- Comment In a similar vein to Hannes's comment, wouldn't WordNet synsets be better connected to lexeme senses via exact match (P2888)? Mahir256 (talk) 00:20, 6 August 2020 (UTC)
- @Mahir256: Lexeme senses are already linked to Wikidata items. While we could also link lexeme senses to external resources, my view is that the item is a better locus for asserting ontological equivalence. (See my notes about P2888 below.) Bovlb (talk) 16:50, 10 August 2020 (UTC)
- Comment See this previous proposal - it was well supported, but the proposers ended up going with the exact match (P2888) property instead; is that still sufficient for this? ArthurPSmith (talk) 19:35, 6 August 2020 (UTC)
- @ArthurPSmith, Mahir256: Regarding P2888, I don't think it's appropriate/useful to use that property for adding hundreds of thousands of URL links to a single resource when we could be creating a new property instead. For one thing, it does not make for efficient SPARQL if multiple namespaces share the same property. As noted above, we only have 188 WordNet links using P2888, so I don't think it is serving anyone's purposes. Cheers, Bovlb (talk) 16:50, 10 August 2020 (UTC)
- Ok, I'm convinced! Support ArthurPSmith (talk) 17:46, 10 August 2020 (UTC)
- Support last time I checked and tried to figure out why we didn't have this it somehow left me puzzled. --- Jura 20:57, 10 August 2020 (UTC)
- Support I am convinced. --Hannes Röst (talk) 15:30, 13 August 2020 (UTC)
- Support --SynConlanger (talk) 16:11, 14 August 2020 (UTC)
- Neutral
Initial opposeHello @Bovlb:,- According to [1] (to keep as a reference), RegEx should be of the form
\d{8}\-[nvarsp]
- Example 3 leads nowhere. Found under [http://wordnet-rdf.princeton.edu/id/07692347-n 07692347-n]
- If the binding is mainly for WD, by force of circumstance, there will be almost XXXXXXXX-n than the rest of the options (part-of-speech: varsp). e.g. lemma reflective (s and a), on WD: absent from Qs, but 3 lexemes present. What do you propose to change that? Cordially. —Eihel (talk) 17:20, 22 August 2020 (UTC)
- @Eihel: I have fixed the regex (thanks for the link) and example 3 as you suggest. (Example 3 worries me slightly, because it suggests that some interface I used at the time was not zero-padding the offsets, but I cannot reproduce it today. Need to check for that when coding sweeps.) Your third point is a good one. We have the same issue on our side with reflective (L41580) which, unlike reflectivity (L228164), has no item for this sense (P5137) claim. Similar issue for reflect (L5825) and (say) 00632042-v. Do you know if we have any plan to link these lexeme senses to items, or otherwise denote their semantics? Bovlb (talk) 16:00, 24 August 2020 (UTC)
- According to [1] (to keep as a reference), RegEx should be of the form
- Hello Bovlb, In view of my history, I am a "Beotian" for the lexicographical side of WD and, precisely, I expected that you would give me a reply. But indeed, it seems more interesting to me to capture most of the site IDs. With this in mind, I'm not blocking goodwill: I'm changing my opinion, but I'm not "excited" to include this property only for Qs. (I was "expeditious" on a large number of proposals since yesterday (creation, closure, etc.), because the new proposals were queuing to display correctly. You see me sorry.) For example, You always have the possibility of putting
on hold
in the status field, without omitting to include an explanatory message in the Discussion section, such as:{{Wait}}
requesting advice from the lexico specialists blahblah. Looking forward to reading you. —Eihel (talk) 03:43, 25 August 2020 (UTC) - ps. If the proposal changes significantly, you need to notify everyone involved of this section. —Eihel (talk) 03:49, 25 August 2020 (UTC)
- Hello Bovlb, In view of my history, I am a "Beotian" for the lexicographical side of WD and, precisely, I expected that you would give me a reply. But indeed, it seems more interesting to me to capture most of the site IDs. With this in mind, I'm not blocking goodwill: I'm changing my opinion, but I'm not "excited" to include this property only for Qs. (I was "expeditious" on a large number of proposals since yesterday (creation, closure, etc.), because the new proposals were queuing to display correctly. You see me sorry.) For example, You always have the possibility of putting