Open main menu

User talk:Pintoch

About this board

Previous discussion was archived at User talk:Pintoch/Archive 1 on 2017-06-21.

SourceMD after Mangus's changes?

9
Trilotat (talkcontribs)

Apparently Magnus made some edits to the tool. Can you confirm you're satisfied and can restore the tool? Thanks.

Pintoch (talkcontribs)

Hi! I have no idea actually, I am not the one to satisfy! @ArthurPSmith, Sic19: do you expect these changes to solve all the duplicate creation issues you reported?

Trilotat (talkcontribs)

In the interest of full disclosure, my efforts to repair remaining duplicate DOIs is limited to Geology journal. I didn't want to appear to be trying to lull you into a false sense of completion.

Sic19 (talkcontribs)

It is a fix for the duplicate DOIs.

The ORCID duplicates will be still be created if SourceMD is used like it was before the block. The problem is perhaps more the way SourceMD was being used that the tool itself. We could discuss other options, such as regular maintenance tasks or conditions on usage of SourceMD, but the problem is not resolved.

ArthurPSmith (talkcontribs)

The problem with duplicate DOI's should be fixed now yes. Between myself and @Trilotat I believe all of the bad DOI's have now been corrected, and the associated duplicates merged. If SourceMD is currently being blocked then yes I think it deserves to be put back into play; most of what it did always was perfectly fine, and it's now going to be even better.

Trilotat (talkcontribs)

ArthurPSmith, I worked on fixing GEOLOGY journal's duplicate DOIs only, no other journal. I don't want to misrepresent my efforts.

Pintoch (talkcontribs)

@Sic19: given Arthur's opinion above, would it be appropriate to unblock the tool? I cannot propose to narrow it down to certain users as some edits are done via QuickStatementsBot without mentioning the responsible user in the edit comment (which should probably be fixed, IMHO).

Sic19 (talkcontribs)

Yes, it would be good to get batch mode running again and I have no objections to the block being removed. We need to monitor the ORCID constraint violations for problems.

There are other data quality issues which I believe are mainly SourceMD related. I haven't full scoped out the problems but here are some examples with labels containing <sup>, <sub>, and <i>. Another problem is � characters in the title and author name string of SourceMD imports. Not blocking issues though.

How can we encourage SourceMD users to check their batches and fix any problems?

Pintoch (talkcontribs)

The tool is unblocked now.

Reply to "SourceMD after Mangus's changes?"

OpenRefine author/title reconciliation

2
Jheald (talkcontribs)

Hi Pintoch! I've been doing a bit of author/title matching against VIAF, LoC, and ISNI using a nest of Perl scripts, to try to match authors from the MARC 100 field of the catalogue entry of a book to VIAFs, LCNAFs, ISNIs, and Wikidata items. (For cases where the book does not currently have a Wikidata item, as a step towards creating one).

There are various reconciliators that try to do author matching against these services, eg:

How big a job would it be to create an author/title reconciliator, rather than just an author conciliator ?

And also, to extend what these conciliators do, to be able to retrieve foreign IDs from these services (eg LoC IDs from VIAF), in the way that eg the Wikidata conciliator can add columns for the values of Wikidata properties based on a match?

Is there enough support in the community that this could be offered eg as a student project for a Digital Humanities student? Or would writing/adapting an OpenRefine reconciliator be rather too big an ask?

The British Library quite liked the rough samples from my Perl scripts, but they're a bit close to the metal; whereas an OpenRefine reconciliator could be something that anybody could use. What would be your instincts on this?

Pintoch (talkcontribs)

Hi Jheald,

That sounds like a great project! Currently, we badly need a solid implementation of the reconciliation API that can easily be configured for many data sources. Conciliator is designed just for that, but the author seems to be a bit short on time to update it. He has started to implement the data extension API (which is required for the "Add column from reconciled values" operation) but it is not ready for prime time yet. It would definitely be a very nice project for anyone who is not too daunted by Java - the scope should be manageable for a student.

I cannot work on this directly myself at the moment but I would be happy to help anyone if they have trouble finding their way in the current landscape.

Reply to "OpenRefine author/title reconciliation"
RobertAR1995 (talkcontribs)

Thank you for the information; it could be useful for my future contributions.

Pintoch (talkcontribs)

You are welcome! I will delete the item then.

Reply to "Thanks"
Jheald (talkcontribs)

4 !votes in the Salvador Dali discussion, two of them very skeptical, none of the skeptical points addressed.

Don't you think property creation might have been premature here?

Pintoch (talkcontribs)

Hi Jheald,

Sorry if this seems premature for you! Formally speaking, I can only see two votes there, both of which are support votes.

In the interest of making your skepticism about the proposal clearer, it would be great if you could use the {{Oppose}}, {{Wait}} or {{Comment}} templates next time - that really helps assessing the status of a proposal.

The property was marked as ready for creation by Thierry Caro and my creation was based on his assessment, so I suggest you direct your complaints to him too. I personally have no opinion about this particular property, but I suspect it would not get deleted if nominated at WD:PFD (but that should not deter you from trying of course).

It might be worth starting a more general discussion about the use of external ids to link to arbitrary websites - I also have mixed feelings about promoting random URLs to authority control identifiers. But that's beyond the scope of this particular case.

Cheers

Jheald (talkcontribs)

Since Thierry Caro was the property's proposer, it was hardly for him to assess whether it was 'ready'. I don't think this was appropriate. Jheald (talk) 13:22, 11 March 2019 (UTC)

Pintoch (talkcontribs)

Sure! Then I think WD:PFD is the way to go.

Pigsonthewing (talkcontribs)

"Formally speaking, I can only see two votes there"

That's the problem; it's not a poll, and the result is not decided by simply counting votes.

Pintoch (talkcontribs)

Sure. Again I agree that it was premature to mark the proposal as ready, which has misled me into creating it. To prevent this from happening in the future, I will stop creating properties now.

Reply to "Salvador Dali item premature?"
Simon Villeneuve (talkcontribs)

Salut,

J'ai installé OpenRefine sur mon ordinateur (version Windows) et j'ai commencé à explorer la chose. Malheureusement, malgré ma lecture des tutoriels et mon visionnement de quelques vidéos, je n'y arrive pas.

Exemple : Je produit le fichier .csv à partir de la requête suivante. J'obtiens une colonne "item" dans QuickRefine. J'ai voulu créer un schéma, mais je n'arrive pas à draguer "item" dans la case "élément". Je tente donc de réconcilier les colonnes en fonction des taxons, mais ça prend trop de temps. Je tente à nouveau de réconcilier automatiquement, et ça me donne une colonne avec ce genre d'adresse https://www.wikidata.org/wiki/Http://www.wikidata.org/entity/Q160482. Malgré tout, une petite barre verte apparaît en-dessous de mon nom de colonne "item" dans l'onglet "Schéma". Je tente à nouveau de draguer dans la case "entrez un élément ou déposer une colonne" et ça ne fonctionne toujours pas.

Vous pouvez m'aider docteur ?

Pintoch (talkcontribs)

Salut!

Quand tu as une colone d'URI d'éléments (typiquement issue d'une requête SPARQL), il faut effectivement la réconcilier avant de pouvoir l'utiliser dans un schéma. Tu as deux méthodes pour ça:

  • réconcilier la colonne normalement (dans ce cas-là c'est mieux de ne pas restreindre la réconciliation à un type particulier, a priori) - la réconciliation va reconnaitre les URIs et les transformer en cellules réconciliées. Mais effectivement ça peut prendre du temps.
  • réconcilier avec la nouvelle opération "Use values as identifiers". Mais avant de faire ça il faut transformer ta colonne pour qu'elle ne contienne que des qids, pas des URIs. Tu peux utiliser l'expression value.split('/')[-1] (pour prendre la dernière valeur dans la liste obtenue en coupant l'URI à chaque /). Ça demande deux étapes mais ça devrait toujours être plus rapide que réconcilier. Je suis conscient que c'est un peu frustrant d'avoir cette première transformation à faire mais j'ai pas encore trouvé de solution propre pour simplifier ça. J'ouvre un ticket pour solliciter d'autres avis: https://github.com/OpenRefine/OpenRefine/issues/1953

J'espère que c'est plus clair. :)

Simon Villeneuve (talkcontribs)

Effectivement !

J'ai tenté de retirer l'URI avec une expression rationnelle, mais ça ne fonctionnait pas.

Où dois-je utiliser l'expression value.split('/')[-1] ?

Pintoch (talkcontribs)

Clique sur le menu de la colonne, choisis "Edit cells" -> "Transform..." et là tu pourras entrer une expression qui transforme tes valeurs.

Pintoch (talkcontribs)
Reply to "OpenRefine"
Davidpar (talkcontribs)
GZWDer (talkcontribs)

This paramater (used in property proposals) is not used in property documentation (see Module:Property documentation), and should not be added to property talk pages. Extent properties should be classified using instance of (P31) only.

Pintoch (talkcontribs)

Thanks for letting me know! It is not very useful indeed. I will adapt my script accordingly.

Trilotat (talkcontribs)

Bonjour. I suggested a change to P5824 so that the retraction is placed nearer the top of a retracted article. Perhaps it's worth considering. I'm not savvy on how to make the recommendation, so I thought I'd put it here. Regards.

Pintoch (talkcontribs)
Trilotat (talkcontribs)

"Displayed higher up" is what I'm suggesting. I'll go propose it there. Merci.

VIGNERON (talkcontribs)

Bonjour Pintoch,

J'essaye d'utiliser OpenRefine (avec la version 3) et cela semble fonctionner partiellement.

Plus concrètement, il y a tout un tas de stations de bus qui ont une mauvaise valeur en P131. Par exemple Q56710762 a juste located in the administrative territorial entity (P131) = France (Q142). Sauf que la description en français contient le code INSEE de la commune. Je me suis dit qu'OpenRefine était l'outil parfait pour résoudre cela.

Je fais donc une requête SPARQL qui me fournit la liste des éléments à corriger (station de bus avec Q142). Je mets ces éléments dans Clipboard et je crée un projet. Je fais "reconcile" et "Use values as identifiers" (ça marche) puis "Add columns from reconcilied values" sur "Dfr" (là aussi ça marche bien) et enfin un "Add column based on this column" avec l'expression "substring(value,length(value)-5)" pour ne garde que le code INSEE (93073, là aussi nickel) et enfin je reconcilie cette colonne pour avoir Tremblay-en-France (Q242497). Tout cette partie fonctionne et je dois avouer que c'est assez génial !

Par contre, si je vois bien "Tremblay-en-France" c'est toujours 93073 qui est stocké dans la cellule quand je fais "edit" (mais peut-être est-ce normal… ?). Et si je fais un export CSV, j'ai "Tremblay-en-France" alors que je m'attendrais à avoir le QID Q242497…

Quand je fais ensuite "Edit Wikidata Schema", là je tombe sur un problème. Le glisser-déposer ne fonctionne pas (contrairement à ce que montre File:OR-WD-editing-tutorial-drag-subject.gif), j'ai beau glissé cela ne dépose pas et je n'ai donc pas le menu qui apparaît…

Aurais-tu une idée de ce qui se passe ou ce qui bloque ?

Pintoch (talkcontribs)

Pour le premier problème (le fait que les valeurs des cellules n'ont pas changé après réconciliation), c'est effectivement le comportement attendu. Tu peux configurer l'export pour que les cellules réconciliées soient exportées avec leur identifiant ou avec leur nom réconcilié en utilisant le "Custom Tabular Exporter" (dans le menu "Export") qui te permet de configurer tout ça.

Pour le deuxième problème c'est moins clair. Est-ce que la colonne est soulignée en vert dans l'onglet du schéma ? Est-ce que dans la vue principale (celle où on voit le tableau) la colonne a une barre verte en dessous de son nom ? Éventuellement si tu peux exporter ton projet ("Export" -> "Export project") et l'envoyer quelque-part je peux enquêter plus finement.

Nomen ad hoc (talkcontribs)

Hello Pintoch ! Merci pour ta célérité à créer les propriétés acceptées après une semaine de discussion. Je me permets juste de te faire remarquer un truc : je ne signe désormais plus mes propositions, car j'utilise le champ |proposed by=. Pourrais-tu STP continuer à me notifier ? Ou bien trouverais-tu préférable que j'ajoute tout de même une signature ?

Pintoch (talkcontribs)

Salut! Oui le plus simple que tu ajoutes quand même une signature : ça a l'avantage de montrer quand la propriété a été proposée (justement pour évaluer si une semaine est écoulée).Sinon, mon script utilise actuellement les liens vers les pages de discussions pour notifier les gens: tant que tu mets User talk:Nomen ad hoc quelque-part, tu seras notifié.

Nomen ad hoc (talkcontribs)

Merci pour ces précisions ! À bientôt :)

Return to the user page of "Pintoch".