Wikidata:Property proposal/URL match pattern
URL match pattern edit
Originally proposed at Wikidata:Property proposal/Generic
Description | regex pattern of URL that an external ID may be extracted. Qualifier "URL match replacement value" can overwrite the default \1. Use non-capturing groups when needed "(?:)". |
---|---|
Data type | String |
Domain | property |
Example 1 | IMDb ID (P345) → (one of multiple values) https:\/\/www\.imdb\.com\/(?:title|name|news)\/([a-z0-9]+)(\/.*)?
|
Example 2 | PubMed ID (P698) → https:\/\/pubmed\.ncbi\.nlm\.nih\.gov\/(\d+)(-[^\/]*)?\/
\1 |
Example 3 | ISNI (P213) → https?:\/\/www\.isni\.org\/(\d{4})(| |%20)(\d{4})(| |%20)(\d{4})(| |%20)(\d{4})
\1 \3 \5 \7 |
Example 4 | ZVG number (P679) → http:\/\/gestis-en\.itrust\.de\/nxt\/gateway\.dll\/gestis_en\/0+([1-9]\d+)\.xml.*
\1 |
Example 5 | CricketArchive player ID (P2698) → https:\/\/cricketarchive\.com\/Archive\/Players\/\d+\/\d+\/(\d+)\.html
\1 |
Example 6 | Fandom article ID (P6262) → https:\/\/([a-z0-9\.-]+)\.(wikia|fandom)\.com\/wiki\/(.*)
\1:\3 |
Example 7 | Geni.com profile ID (P2600) → https:\/\/www\.geni\.com\/(?:profile|people)\/[^\/]+\/(\d+)(#.*)?
\1 |
See also |
URL match replacement value edit
Description | (qualifier only) optional qualifier to overwrite the default \1 |
---|---|
Data type | String |
Example 1 | see above |
Motivation edit
This will provide a way to extract property and ID from a given URL. A future tool or gadget may benefit from this. GZWDer (talk) 23:46, 26 February 2020 (UTC)
Discussion edit
Comment Here's an example of how this would look on Fandom article ID (P6262):
URL match pattern |
| ||||||||||||
add value |
If a tool wanted to automatically generate a Fandom article ID (P6262) from the URL https://minecraft.fandom.com/wiki/Sheep for example, it would match the regex specified with property against that URL. There are three caputring groups in the regex. The first one is ([a-z0-9\.-]+)
, and matches "minecraft", the second one is (wikia|fandom)
and matches "fandom", and the third one is (.*)
and matches "Sheep". The URL match replacement value allows these capturing groups to be put together. \1:\3
turns into minecraft:Sheep
, since \N
is replaced with the value of the nth capturing group.
--SixTwoEight (talk) 01:52, 4 March 2020 (UTC)
- Support —Eihel (talk) 09:55, 17 May 2020 (UTC)
- @Ivan_A._Krestinin: what do you think? Currently these are mostly defined in autofix templates. --- Jura 15:10, 15 July 2020 (UTC)
- Comment, I just wanted to suggest the same, but with reusing applies if regular expression matches (P8460).
Here is what I thought about:
| |||||||||||||
add value |
- This is specifically targeted to Wikidata:Entity Explosion. I was slightly disappointed, when I understood that due to tech reasons this extension does not support links like:
- https://movieplayer.it/personaggi/douglas-adams_26898/ (Movieplayer person ID (P4782) 26898)
- https://www.muziekweb.nl/Link/M00000364497/POPULAR/Douglas-Adams (Muziekweb performer ID (P5882) M00000364497)
- https://www.fantascienza.com/catalogo/autori/NILF10014/douglas-adams/ (Vegetti Catalog of Fantastic Literature NILF ID (P2191) 10014)
- Also, ping @99of9:. --Lockal (talk) 09:28, 6 October 2020 (UTC)
- Comment Thanks for the ping Lockal. Something like this will almost certainly help. I don't have time right now to get my head around the alternatives, but I'm very glad to see this. --99of9 (talk) 11:17, 6 October 2020 (UTC)
- @GZWDer:, sorry for interruption, what do you think about my suggestion above (about reusing of applies if regular expression matches)? Also, inversion of pattern/replacement creates somewhat cleaner representation in case when different URL patterns are using the same replacement (which is impossible it an opposite way: a single URL pattern can never use multiple replacement patterns). If you agree, could you update your proposal, please? --Lockal (talk) 07:58, 13 October 2020 (UTC)
- Support Yes, I can see this being very useful, thanks for proposing! ArthurPSmith (talk) 18:19, 6 October 2020 (UTC)
- Support I think I prefer this over Wikidata:Property proposal/URL extractor regular expression but I prefer either over nothing. BrokenSegue (talk) 03:54, 19 November 2020 (UTC)
- Lockal interesting, but Replacement for external id extraction would be
\1
in the vast majority of cases? Wouldn't it? --Shisma (talk) 19:13, 19 November 2020 (UTC)- Shisma, ok, I did not thought about a default value. And even without a default value the more I look, the more I agree that original proposal is more intuitive. There is no point to argue about it, so I strikethrough my comment above. Pattern and replacement are just a pair (or a tuple in general), until Wikibase implements "typed tuple" datatype, any model will be ugly anyway. --Lockal (talk) 10:04, 8 December 2020 (UTC)
- Lockal interesting, but Replacement for external id extraction would be
- Support per discussion at Wikidata:Property proposal/URL extractor regular expression. --- Jura 12:31, 21 November 2020 (UTC)
- I added some formatting to the samples above. BTW, should we make "$2" the default replacement value? This could make the qualifier "replacement value" optional. See Property_talk:P973 for some regexes. --- Jura 17:44, 21 November 2020 (UTC)
- @Jura1: Thanks for clarifying the proposal. Why 2 and not 1? A 2 can always be made into a 1 using non-capturing regex groups. Also I assume you meant
\2
not$2
. BrokenSegue (talk) 17:50, 21 November 2020 (UTC)- At Property_talk:P973, it sometimes ended up being \2. Either because I didn't know better or non-capturing groups aren't supported. I suppose we should pick either $2 or \2. --- Jura 17:59, 21 November 2020 (UTC)
- @BrokenSegue: It seems that
(?:)
is supported by Krbot. I added that the qualifier is optional if it's the default \1. --- Jura 08:20, 1 December 2020 (UTC)
- @Jura1: Thanks for clarifying the proposal. Why 2 and not 1? A 2 can always be made into a 1 using non-capturing regex groups. Also I assume you meant
- I added some formatting to the samples above. BTW, should we make "$2" the default replacement value? This could make the qualifier "replacement value" optional. See Property_talk:P973 for some regexes. --- Jura 17:44, 21 November 2020 (UTC)
- Support --Shisma (talk) 13:22, 21 November 2020 (UTC)
- Support with the caveat that we should only have one of this or Wikidata:Property proposal/ID pattern and not both. I look forward to something like this for a tool that can do something like Mix'n'match (Q28054658) but without needing to upload an external database (which is not always available often for copyright reasons, etc.) Of courser it would require some sort of URL space crawling but still it could be done by users just wandering through such. —Uzume (talk) 03:11, 12 December 2020 (UTC)
- @Uzume: indeed! I will withdraw the ID pattern proposal as soon as this proposal passes. (or should I do it now?) --Shisma (talk) 08:53, 12 December 2020 (UTC)
- @ArthurPSmith: any updates on this? -Shisma (talk) 13:45, 19 December 2020 (UTC)
- @Shisma: I created them. Please feel free to help complete them. @BrokenSegue, GZWDer, 99of9, Lockal: please make good use of the properties --- Jura 13:50, 19 December 2020 (UTC)
- I added 90 patterns from my archive. I will try to implement the property into my extension this weekend. Thank you all 🥳 --Shisma (talk) 15:01, 19 December 2020 (UTC)