User:Salgo60/ExternalIdentifiers

One way to design a system to be a good external identifier in WikidataEdit

A small try to write down a checklist / Best practise - Salgo60 (talk) 10:38, 14 November 2020 (UTC)

  1. have persistant unique IDs for things like parish, country, places that are containers that Wikidata can do same as with
    1. container objects shall have a landing pages and the persistent unique ID should be visible for the UI user compare Alvin Söderala
    2. all landing pages should be supported by GET i.e. you can address that page with an URL and dont need POST we have that problem with SCB Regina database see T200700
  2. to be a good member and on level 5 - link your data to other data to provide context they should have same as external authorities visible. A small step is same as Wikidata Q-number
    1. I hope museums will have better place identifiers we are trying to connect to Gotlands museum (P7068), Malmö Museer ID (P8773) what we see is that they have a place but dont say same as so we dont understand what Administrative level we speak about e.g. if they say "Söderala" we dont know if it is Söderala (Q2673411) Söderala parish (Q10688474) Söderala church parish (Q10688470) Söderala (Q21779139)
      1. identifiers for streets. My understanding is that en:Lantmäteriet has no persistent unique id for objects like streets see blog
  3. objects should have version history and support for merges by supporting redirects from the old item to the "new" item
    1. this should also be supported by the API compare Wikidata 'owl:sameas and a query merges
  4. they shall have a SPARQL endpoint and/or JSON access so that we can easy check differences easy between Wikidata and the external system see e.g. SKBL, Nobel prize...
    1. Good documentation of the API like using Swagger see ISOF, JobTech, Nobelprize
  5. have timestamps for created and changed
  6. nice to have
    1. a change API like Wikidata
    2. support for a Query language like WDQS
    3. linking back to Wikipedia pages/Wikidata e.g. VIAF, Swedish Litteraturbanken
  7. deleted items should be easy to find compare problems Europeana has with Wikidata that gets deleted
  8. support for more languages
    1. SKBL has support for Swedish and English by changing url e.g Greta Garbo json sv en ---> we now support both templates in en:Wikipedia and in sv:Wikipedia --> 9 million visitors to those Wikipedia pages this year sv / en

Example of possibilities and problems we findEdit

  1. Persistent
    1. Graves at www.svenskagravar.se. Quote svenskagravar "they are persistant IF we dont reload all graves". They have now reloaded more times --> i.e. its not persistent
    2. A Swedish site containing local history material "Sveriges Hembygdsförbund" upgraded to a modern plattform and also "upgraded all ids" --> we needed to delete all linked items see T248875 and start from scratch
    3. Europeana was an external identifier in P727 (P727) but lesson learned was that it was not persistent so then they implemented a new approach and I created Europeana entity (P7704). Lesson learned was as you see below that the new approach has quality problems
  2. Quality Europeana did copy 160 000 items from dbpedia/Wikidata for artist BUT they havnt done the homework connect the right objects to the right artists instead used text strings and guessed --> bad quality see T243764
  3. Error reporting When connecting two domains you find problems/errors e.g. Wikidata has indication on many duplicates in Uppsala University Alvin database but we have no easy way to report errors/ or they dont use Wikidata / Phabricator were we track issues see list duplicates or Task T243764
  4. Uniqueness - the Swedish National archives has NAD i.e. id for archives. We have reported that they are not unique and now we see some redesign using en:GUID to fix this see also Task T200046
    1. disambiguation page if a name space gets more items that can be described with the same names create disambiguation pages. In the new design of NAD from the Swedish National Archive it looks like they skip this, which from an user perspective is a nightmare if you just have the old ID
  5. Lack of a helpdesk were we get an unique helpdesk id when we ask a question / report an issue . We have this problem with the Swedish National archives, SCB, ISOF, "Lantmäteriet" .... Swedish "Naturvårdverket" has unique numbers but no easy way to see the status e.g. 2018NV38321
    1. workarounds
      1. be active on Wikipedia ==> then we can ping them and discuss issues and agree how we solve things and get feedback of errors in Wikipedia/Wikidata
      2. GITHUB Litteraturbanken and SKBL are active on GITHUB and have issue trackers we use
        1. Litteraturbanken spraakbanken/littb-frontend/issues
        2. SKBL spraakbanken/skbl-portal/issues
      3. Create tasks in Phabricator.wikimedia.org see my backlog
  6. Easy way of ask questions and see what questions other have asked. Good example is Libris most other institutions dont have this
  7. Easy way to subscribe on an issue and get a notification when its moved to production. We have this in Phabricator used by Wikidata and also see change stream
  8. Dataroundtrip as we now support linked data on pictures its getting more and more important to have a data roundtrip approach i.e. changes in WIkidata needs to be tracked and taking care of in both systems we can keep booth systems in synch. Today we try to fix that ad hoc but it would be better if we agreed on a "framework"/"model" examples hat we do today
    1. JSON and structured data
      1. Nobelprize.org Notebook
      2. Swedish female biographies Notebook
      3. The Swedish Literature Bank Notebook
    2. Webpages no API we Webscrape and compare with Wikidata
      1. Swedish National Archive SBL Notebook
      2. Graves Uppsala Notebook
      3. Swedish Academy Notebook
    3. WikiTree a genealoigy site with 180 000 connections to Wikidata WikiTree person ID (P2949) and 22 million profiles
      1. they check the quality of the family tree every against > 250 rules were Wikidata is a number of checks see Data doctors report
  9. GET/PUT we need an easy way of linking using an URL. E:g. SCB Regina is designed for just access a record using post which dont work with WIkipedia see T200700
  10. Clean URLs not using redirects in a perfect world everything is Linked data and WEB 2.0 and data is presented as data. As a workaround to use the power of Wikidata an Australian researcher has created d:Wikidata:Entity_Explosion --> we can get old platforms like SBL, LibrisXL... to use Wikidata for finding "same as". If we install this Webbrowser extension see video we get the magic of Wikidata and how we get problems with e.g. Alvin that has a redirect and a rather "noisy" URL
  11. Active agile product management and easy way to discuss/ get updated of changes (see video about agile product owner). In Wikidata we have
    1. Prioritized open backlog everyone can register and ask question/ subscribe
    2. Weekly status updates Wikidata:Status_updates/2020_11_16 / all
    3. Telegram groups Wikidata and Wikidata Sweden.....
    4. Project chats Wikidata:Project_chat / Wikidata Swedish plus on all pages e.g. property Dictionary of Swedish National Biography ID (P3217) you have Property_talk:P3217
    5. Every 2nd year meeting that are available online e.g. Wikidata:WikidataCon_2019/Program example featured talks 2017 / 2019
    6. We have more research oriented meetings like wikidataworkshop 2 nov 2020, key note
      1. Research papers about Wikidata
  12. Missing vision statements and sharing your future development example we try to connect to the Europeana network and we see [lack of quality it would be of great help if they shared the next step they will take. Without information it looks like they have given up. We have the same "challenge" with the Swedish Riksdagen blog/video were we have no understanding of the vision of classification and small things if they will support who is the substitute of a position, today we have heard they move in direction using Eurovoc and we need to read documents to find who is the substitute for a specific position is that the vision?
    1. Public prioritized backlog the best pattern for success is to have a prioritized backlog open for questions and subscription see the usage of Phabricator for the Wikidata project - video about active product management
    2. EPICS share your Epics example Wikidata
      1. Improve Search Suggestions with NLP
      2. Growth: Newcomer tasks 1.0
      3. Better support for References in Content Translation
      4. Structured data backlog
      5. Feedback processes and tools for data-providers
  13. good tools for measure uptime of service and the usage compare Wikidata Grafana Dasgboard
    1. tools for measure Wiki pageviews eg. article Greta Garbo, sv:Wikipedia articles linking Svenskt kvinnobiografiskt lexikon same for en:WIkipedia
  14. New technologies needs new skills
    1. The Swedish LIBRISXL project started 2012 a project to build a linked data library system see video
      1. 2019 they reported they see no gains of Linked data se report "Leaving Comfort Behind: a National Union Catalogue Transition to Linked Data"
        1. I have tried to asks question why they have odd "same as" or if they have a vision about keywords of books and my feeling is that they havent educated librarians about linked data. Maybe a better approach I think is hire people with semantic knowledge/ Knowledge Engineers and also skilled data scientist can be needed...
        2. Now 2020 it looks like they stop the project and close down forum tools
    2. Europeana started 2012 with Linked data and has today 2021 bug problems wity metadata quality
      1. Lesson learned is that quality Linked data should be created at the source. Europeana tried the approach of guessing "same as" see blogpost --> even if its free the quality is so bad that en:Wikipedia dont link them T243764.
        1. The interesting approach they did was adding WIkidata Q numbers to the artists in Europeana but has very big problems with the challenge of connecting the right object with the correct artists.... root cause is they lack en:Semantic interoperability and move things as "strings not things"