Wikidata:Property proposal/ID pattern

Id patternEdit

Originally proposed at Wikidata:Property proposal/Generic

   Withdrawn
DescriptionA replacement pattern, to form an external id. to be used with applies if regular expression matches (P8460)
Data typeString
Domainproperty
Allowed valuesvalid replacement pattern with $1 for the first match. $2 for the second and so forth…
Example 1Twitter username (P2002) → [id build pattern] → $1
applies if regular expression matches (P8460)/^https?:\/\/(?:mobile\.)?twitter\.com\/(?:intent\/user\?screen_name\=)?(?!hashtag|home|explore|notifications|messages|i)([0-9A-Za-z_]{1,15})/
Example 2subreddit (P3984) → [id build pattern] → $1
applies if regular expression matches (P8460)/^https?:\/\/www\.reddit\.com\/r\/([^\/?#]+)\/
Example 3MusicBrainz artist ID (P434) → [id build pattern] → $1
applies if regular expression matches (P8460)/^https?:\/\/musicbrainz\.org\/artist\/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/
Example 4MusicBrainz artist ID (P434) → [id build pattern] → $1
applies if regular expression matches (P8460)/^https?:\/\/www\.bbc\.co\.uk\/music\/artists\/([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/
Example 5Fandom article ID (P6262) → [id build pattern] → $1:$2
applies if regular expression matches (P8460)/https?:\/\/([a-z0-9\.-]+).fandom\.com\/wiki\/([^\s#\?]+)/
Example 6Fandom article ID (P6262) → [id build pattern] → $2.$1:$3
applies if regular expression matches (P8460)/https?:\/\/([a-z0-9\.-]+).fandom\.com\/([\w]+)\/wiki\/([^\s#\?]+)/
Sourcelist

MotivationEdit

I am currently working on a browser extension, that – among others – displays wikidata entities for websites the user visits. In order to do that, it must be able to know which websites are associated with which external identifier on wikidata. For instance:

 

The url https://twitter.com/timberners_lee contains a twitter handle timberners_lee which in Wikidata is associated with Tim Berners-Lee (Q80).

Currently, the extension uses a static list of regular expressions that only a git contributor is able to expand. A wikidata property would make it much more easy to contribute entries to this list. Plus, other extensions could certainly use it too.

It is crucial that the expression only returns a single capture group, that only contains the id. Other groups must be non-capturing. --Shisma (talk) 16:57, 17 November 2020 (UTC)

DiscussionEdit

  •   Support--Trade (talk) 19:10, 17 November 2020 (UTC)
  •   Support this would be helpful for programs I am currently working on. BrokenSegue (talk) 20:05, 17 November 2020 (UTC)
  •   Comment Isn't this something that external users can infer from the URL formatter property (plus the regex property if it is also present)? (I think @99of9: might have some interesting thoughts on this proposal.) Mahir256 (talk) 01:54, 18 November 2020 (UTC)
    • Mahir256 Here is one example where this approach wouldn't work: Lets take Twitter username (P2002)
      1. we convert https://twitter.com/$1 into a regular expression
        https:\/\/twitter\.com\/$1
      2. and replace $1 with [0-9A-Za-z_]{1,15} wrapped in a capture group
        https:\/\/twitter\.com\/([0-9A-Za-z_]{1,15})
    • now this looks like a senseable regular expression until you notice that it believes hashtag is a twitter user. I have tried to do that but I came to find that neither formatter URL (P1630) nor format as a regular expression (P1793) are designed or used to produce a useful regular expression by that scheme. If you wish I can give you more examples --Shisma (talk) 07:59, 18 November 2020 (UTC)
  •   Comment Entity Explosion (Q98398855) is not that far from what you are trying to do. Thierry Caro (talk) 16:25, 18 November 2020 (UTC)
    • Looking at the source of that extension it seems to do what Mahir256 suggests which is known buggy and unreliable (but works 90% of the time probably). BrokenSegue (talk) 18:26, 18 November 2020 (UTC)
  • There are similar proposals at Wikidata:Property_proposal/URL_match_pattern. --- Jura 19:27, 18 November 2020 (UTC)
    • Hmmm, that proposal seems better in that it allows multiple matchers. But it seems worse because it can't handle multiple regexs per property. It's disheartening that it has been stuck for months... BrokenSegue (talk) 20:26, 18 November 2020 (UTC)
    • I don't see why it couldn't have multiple regexes .. if it lingers there, it's probably that its proposer lost interest. --- Jura 02:09, 19 November 2020 (UTC)
      • @Jura1: Oh, I misunderstood the proposal. Yeah that proposal seems strictly better than this one now. I'll move my support there. BrokenSegue (talk) 03:53, 19 November 2020 (UTC)
        • @BrokenSegue: I think the template on that page should be updated if it's to take in account applies if regular expression matches (P8460) created in the meantime. --- Jura 04:55, 19 November 2020 (UTC)
          • @Jura1: Yes I agree. Seems like a trivial change but I don't want to unilaterally alter that submission and @GZWDer: is no longer active it seems. Maybe @Shisma: will alter this proposal to match? BrokenSegue (talk) 15:16, 19 November 2020 (UTC)
            • @BrokenSegue: you may alter this proposal. --Shisma (talk) 18:17, 19 November 2020 (UTC)
            • sorry I don't understand how applies if regular expression matches (P8460) relates to this --Shisma (talk) 18:57, 19 November 2020 (UTC)
              • @Shisma: the proposal is that we make a new property that explains how to use the output of regex capture groups. So this proposal would change to a property that would take the value \1:\3 and that would have a qualifier applies if regular expression matches (P8460)https:\/\/([a-z0-9\.-]+)\.(wikia|fandom)\.com\/wiki\/(.*). So if that regex matches then you take the capture groups and plug them into the matches to produce the identifier. BrokenSegue (talk) 19:29, 19 November 2020 (UTC)
                • @BrokenSegue: that would even solve some edgecases 👍. But most properties will be set to \1, right?--Shisma (talk)
                  • @Shisma: yeah that's my understanding. BrokenSegue (talk) 16:49, 20 November 2020 (UTC)
                    • well, it's seems counter-intuitive but it's actually better--Shisma (talk) 17:26, 20 November 2020 (UTC)
  • @BrokenSegue: and @Trade: I updated the proposal. Please check if you still support it. Feel free to make changes --Shisma (talk) 17:44, 20 November 2020 (UTC)
    • Sorry for the back and forth, but looking at it now, I actually prefer the initial version. It makes the usecase clear even if the format is similar to P8460. Supposedly we could have an optional ID pattern, but I find it dubious to make it the main value especially as the only use case is a Wikidata property that IMHO shouldn't have been defined that way. The initial version also seems to make it clearer how to include formatting variations. BTW, I think Krbot's Autofixes don't use the leading "/" and seems to add "^" directly. To make a long story short, I will support the one at Wikidata:Property_proposal/URL_match_pattern. --- Jura 12:30, 21 November 2020 (UTC)
      • @Jura1: i'd say the initial version is almost identical (without replacement pattern) to Wikidata:Property proposal/URL match pattern. Isn't it? But I don't care and support both proposals but just one of them should pass. --Shisma (talk) 13:25, 21 November 2020 (UTC)
      • @Jura1: Can you clarify? You're supporting the version of that proposal that uses <replacement value> as a delimiter to stuff two pieces of data into one field? Or one of the modified proposals? BrokenSegue (talk) 17:17, 21 November 2020 (UTC)

@BrokenSegue, Jura1: are there any advantages or disadvantages that one proposal might have over the other? to me it appears like this. --Loominade (talk) 10:22, 23 November 2020 (UTC)

Proposal comparison
Match pattern ID pattern
Pro: more intuitive Pro: re-uses existing property
Pro: has a default value
@Loominade: that is also my understanding. The match pattern one also allows for a "default" case of "\1" but I'm not sure that matters very much. I really don't care which we go with. BrokenSegue (talk) 17:44, 23 November 2020 (UTC)
actually I don't like re-using applies if regular expression matches (P8460) for this. the other proposal has been marked as ready yesterday. Perhaps I should withdraw --Shisma (talk) 18:15, 24 November 2020 (UTC)