Wikidata:Property proposal/applies if regular expression matches

applies if regular expression matches id edit

Originally proposed at Wikidata:Property proposal/Authority control

Descriptionthe statement is only true, if the id matches this regex
Data typeString
Domainproperty
Allowed valuesvalid regular expression with at least one capture group
Example 1
Fandom article ID (P6262)
formatter URL (P1630)https://$2.fandom.com/$1/wiki/$3
if regex([\w]+).([\w-]+):([^\s])
Example 2
Fandom article ID (P6262)
formatter URL (P1630)https://$1.fandom.com/wiki/$2
if regex([\w-]+):([^\s])
Example 3
Fandom article ID (P6262)
https://www.fandom.com/index.php?title=w:c:$1
if regexno value
object has role (P3831) → fallback
Example 4
P6623 (P6623)
formatter URL (P1630)https://$1.gamepedia.com/$2
if regex([\w-]+):([^\s])
Example 5
P6623 (P6623)
formatter URL (P1630)https://tools.wmflabs.org/wikidata-externalid-url/?p=6623&id=$1
Example 6
IMDb ID (P345)
formatter URL (P1630)https://www.imdb.com/title/$1/
if regextt\d{7,8}
See also
  • formatter URL (P1630): web page URL; URI template from which "$1" can be automatically replaced with the effective property value on items. If the site goes offline, set it to deprecated rank. If the formatter URL changes, add a new statement with preferred rank.
  • third-party formatter URL (P3303): URI template from which "$1" can be automatically replaced with the effective property value on items; for sites other than the primary issuing body of the identifier concerned
  • format as a regular expression (P1793): regex describing an identifier or a Wikidata property. When using on property constraints, ensure syntax is a PCRE

Motivation edit

Fandom article ID (P6262) and P6623 (P6623) use a third party services to resolve ids to urls. but i think it could be done entirely on wikidata, if it was possible to pass multiple variables to the formatter URL (P1630). This is a proposal to do that.

We'd need a qualifier holding a regular expression.

  1. the regex will be used to determine which formatter url shall be used. therefore it must not match if the supplied id does not hold the required number of variables.
  2. the regex will also be used to extract the variables from the id to the formatter url.
  3. as a fallback an external resolver may be used if no regex matches the id. this fallback should be highlighted somehow. For this proposal I chose no value. formatter urls that match the regular expression must be preferred.

--Shisma (talk) 21:00, 16 February 2020 (UTC)[reply]

Discussion edit

Do you intend to engage the Wikidata developers so this can be supported in the UI, or how otherwise would you envision this to be actually used? ArthurPSmith (talk) 15:31, 18 February 2020 (UTC)[reply]
@Shisma: Also maybe relevant - see Phabricator Task T150939 ArthurPSmith (talk) 19:27, 20 February 2020 (UTC)[reply]
@ArthurPSmith: it is not clear to me what the link between this property proposal and the Phabricator task? Does it mean this property can be created even if the Phab task is not fixed? Pamputt (talk) 05:55, 17 June 2020 (UTC)[reply]
@Pamputt: I'm not sure the property will be much use without having something like the phab task actually looked at and getting some input from developers on feasibility of this approach. But on the other hand the outline presented here seems well-thought-out, so it would at least provide some input on how the phab task could be done. So I don't have a problem with the property being created soon. ArthurPSmith (talk) 18:38, 17 June 2020 (UTC)[reply]
@Shisma, ArthurPSmith, Tinker Bell, Germartin1, Jura1:   Done applies if regular expression matches (P8460) Pamputt (talk) 13:10, 18 July 2020 (UTC)[reply]
@Shisma, ArthurPSmith, Tinker Bell, Germartin1, Jura1, Pamputt: :( It would have been good to quickly ping the dev team about this. It doesn't look like we can implement this among others for for security reasons. (We can't just work with arbitrary regular expressions and the constraints check we have is an exception.) Property:P345#P1630 for example now makes it considerably harder for 3rd parties to work with the data if there is no preferred formatter statement. What do we do? --Lydia Pintscher (WMDE) (talk) 17:42, 20 July 2020 (UTC)[reply]
@Lydia Pintscher (WMDE): I'm not sure I follow the security concern here - do you have an example of a regular expression that could cause a security problem? Can we restrict the types of regular expressions to avoid the problem somehow? ArthurPSmith (talk) 17:53, 20 July 2020 (UTC)[reply]
https://www.regular-expressions.info/catastrophic.html We have solved it for the constraint checks by using the query service because it has functionality to prevent this when evaluating a regex. For formatting this will not be possible this way. --Lydia Pintscher (WMDE) (talk) 18:00, 20 July 2020 (UTC)[reply]
I think a very limited collection of regex's should be ok though. For this, if we disallow any nesting of grouping or quantifiers on groupings (i.e. (..(..)) or (..)* or (..)+ or (..)? are all forbidden) would that still pose a problem? ArthurPSmith (talk) 18:10, 20 July 2020 (UTC)[reply]
Maybe also no lazy matching (*?, +? etc) and require the regex to match the entire identifier string (implicit ^ at start and $ at end). That should keep things pretty efficient I think. ArthurPSmith (talk) 18:19, 20 July 2020 (UTC)[reply]
@ArthurPSmith: Well, Wikibase can’t just trust that the regexes will be safe, and I’m not convinced it’s easy to detect whether they are or not. (T214378 proposed an even more limited subset of regexes, and that hasn’t gone anywhere, either.) --Lucas Werkmeister (WMDE) (talk) 14:33, 23 July 2020 (UTC)[reply]
@Lucas Werkmeister (WMDE): Is there maybe a PHP library available that can handle just the simplest regexes without the exponential growth problem? It seems like a very common issue! ArthurPSmith (talk) 15:15, 23 July 2020 (UTC)[reply]
@ArthurPSmith: The closest thing is probably RE2 (Q7299973). More generally, solutions to this problem are the subject of the RFC T240884, but I don’t know when that will move forward, and without it we can’t implement support for this new property. (And, like Lydia says, I would’ve preferred to testify this before the property was created :/ ) --Lucas Werkmeister (WMDE) (talk) 12:24, 24 July 2020 (UTC)[reply]
@Lucas Werkmeister (WMDE): Ah, RE2 was exactly the sort of thing I was thinking of, it's unfortunate there's no native PHP support. Would it be helpful to chime in on T240884 that there are other reasons we might want this for Wikidata use? ArthurPSmith (talk) 14:57, 24 July 2020 (UTC)[reply]