Wikidata:Requests for permissions/Bot/Descriptioncreator

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Not done, No movement on this task for a month, if you want to come back to it feel free to re open the discussion ·addshore· ^{talk to me!} 10:30, 9 August 2013 (UTC)[reply]

Descriptioncreator

Descriptioncreator (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Noimnotabot (talk • contribs • logs)

Task/s: Adding descriptions to items.

Function details: --Noimnotabot (talk) 13:44, 7 June 2013 (UTC) Most of the Wikipedia articles start with the line "[Topic] is/was a/an [description].". I am planning to create a bot that parses those descriptions from Wikipedia and add them to items in Wikidata that don't have a description yet (if they parse/meet the criteria). Some descriptions in Wikipedia may be too long to use as a description for Wikidata. Those descriptions aren't added.[reply]

Which languages do you plan to support? And maybe the bot's username should contain the word 'bot'. Would you also like to do some test edits? -- Bene* ^t_a^l_k 14:45, 7 June 2013 (UTC)[reply]

I plan to support English first, maybe I add support for other languages later if it proves useful. Should I make an other account with 'bot' in it?--Noimnotabot (talk) 14:52, 7 June 2013 (UTC)[reply]

Also, could you please provide us with some examples of descriptions you'd add? How will you be figuring out what portion of the description to copy? Will you be building in any filters to avoid leads that may have been vandalized? — PinkAmpers&^{(Je vous invite à me parler)} 15:38, 7 June 2013 (UTC)[reply]

For example from the beginning of the article Karaoke: "Karaoke (カラオケ?, bimoraic clipped compound of Japanese kara 空 "empty" and ōkesutora オーケストラ "orchestra")[1] (/ˌkæriˈoʊki/ or /ˌkærəˈoʊki/; Japanese: [kaɽaoke] ( listen)) is a form of interactive entertainment or video game in which amateur singers sing along with recorded music (a music video) using a microphone and public address system." it will parse the description "form of interactive entertainment or video game in which amateur singers sing along with recorded music (a music video) using a microphone and public address system". I have code for that now (basically a regex). It does not filter yet. Can you elaborate on that point? Is there a bot that does some kind of filtering? I think it may be required to manually check all the descriptions to be sure.

As said in Help:Description, "descriptions should be long enough to allow people to easily grasp what the entry's label refers to, and no longer than that". So the example given by you is definitely too long. -- Bene* ^t_a^l_k 17:36, 7 June 2013 (UTC)[reply]

Ok, what maximum would you suggest? It will also be nice even if 1% of the Wikipedia descriptions can be converted.--Noimnotabot (talk) 19:04, 7 June 2013 (UTC)[reply]

Two things: First of all, yes, as Bene* said, descriptions must be short. This is also something of a legal issue, since Wikidata items are in the public domain, meaning that copying substantial chunks of Wikipedia articles can constitute a copyright violation. Secoondly, umm, well, you'd get some false positives, but you could just try filtering out common swear words and insults. — PinkAmpers&^{(Je vous invite à me parler)} 21:27, 7 June 2013 (UTC)[reply]

Oh, thank you. Didn't know about the license issue yet and about the choice for CC0 for WikiData. The idea is maybe not that good afterall? Or do you think using for example a maximum of 8 words per article is not a copyright violation? I guess some filtering could be used yes, like the first ClueBot.--Noimnotabot (talk) 22:48, 7 June 2013 (UTC)[reply]

Descriptions provide disambiguation which in many cases is sorely needed but I'm a bit skeptical of the approach. If you want to write the code in less than six months and you want to keep a sufficiently low level of error as well as reasonably short descriptions, you'll probably end up throwing away 99% of items as inconclusive or impossible to parse satisfactorily. You might be able to remedy the problem by focusing on specific categories of articles and using basic infobox data. If you start in en:Category:Italian politicians, chances are you'll find Italian politicians. The problem is that you don't necessarily want to describe them simply as Italian politicians because they could be much more famous for something else. But if you also find the politician infobox and you see that the first sentence is simply "Michele Coppino (1 April 1822 - 25 April 1901) was an Italian politician." (note the absence of additional occupations in the sentence) then you can quite confidently write "Italian politician" as the description. It might be a little too conservative but it's not too hard to code and it would probably still get a lot of work done. Pichpich (talk) 04:45, 8 June 2013 (UTC)[reply]

I don't think there's anything wrong with having a vaguer-than-ideal description, if the alternative is no description at all. The only issue is finding the best way to parse the articles. — PinkAmpers&^{(Je vous invite à me parler)} 16:29, 8 June 2013 (UTC)[reply]

If you're not careful, you might end up describing someone who was mayor of his village for 3 months and sold millions of records as simply "Italian politician" instead of "Italian singer" or "Italian singer and politician". The latter two would be ok but the first is not just vague, it's completely misleading. Pichpich (talk) 23:08, 8 June 2013 (UTC)[reply]

However, I think this case would be extremely rare and we could live with those few mistakes. Or are there some more frequent examples? -- Bene* ^t_a^l_k 10:03, 9 June 2013 (UTC)[reply]

I can only provide anecdotal evidence but I found a lot of cases of musicians (typically singers) that were described simply as, say, "American actress" despite the fact that their acting career was very limited. I can't remember the most egregious examples but for instance, KLBot2 added the Spanish label "actriz estadounidense" to Dolly Parton. This isn't horrible: Dolly Parton actually had a significant acting career. Nevertheless, that description is a failure since Parton's singing career is much more important (she's arguably the most successful female country singer of all time). I'm not sure how KLBot2 generated that description but it might be instructive to see how it derived it. Sometimes, it's important to list two (or more) occupations in the description. For instance, it's reasonable to assume that many people are only aware of one half of Sonny Bono's career so a description that only mentions one should be considered a failure and possibly worse than no description. Pichpich (talk) 15:31, 9 June 2013 (UTC)[reply]

Comment I think the best way for us to evaluate the bot request would if you could give us an example list like:

*[[Q###]] - description
*[[Q###]] - description

of maybe 100 or so items. This would give you an opportunity to work on the regex, and let us see what kind of descriptions would come out, rather than an abstract example. Thanks! Legoktm (talk) 07:11, 9 June 2013 (UTC)[reply]

Maybe you might also publish the regex you'd use so that the community can improve it together. -- Bene* ^t_a^l_k 10:03, 9 June 2013 (UTC)[reply]

I did some work on the parser. I use a somewhat customized mediawiki-parser now for parsing Wiki to text. Thank you for all the suggestions. This is the result that I get now (without filtering or limiting words):

Click [expand] to view the content

[[wikidata:Q1]] prevailing cosmological model that describes the early development of the Universe
[[wikidata:Q2]] third planet from the Sun
[[wikidata:Q3]] characteristic that distinguishes objects that have signaling and self-sustaining processes from those that do not
[[wikidata:Q4]] permanent cessation of all biological functions that sustain a particular living organism
[[wikidata:Q5]] discipline of anthropology
[[wikidata:Q8]] well-known symbol of happiness*Happiness* is a mental or emotional state of well-being characterized by positive or pleasant emotions ranging from contentment to intense joy
[[wikidata:Q13]] superstition and related to a specific fear of Friday the 13th
[[wikidata:Q15]] world\'s second-largest and second-most-populous continent
[[wikidata:Q16]] crowned lion holding a red maple leaf
[[wikidata:Q18]] continent located in the Western Hemisphere
[[wikidata:Q19]] getting of reward for ability by dishonest means
[[wikidata:Q21]] country that is part of the United Kingdom
[[wikidata:Q23]] first President of the United States (1789\u20131797)
[[wikidata:Q24]]   fictional character and the lead protagonist of the Fox television series _24_
[[wikidata:Q25]] country that is part of the United Kingdom and the island of Great Britain
[[wikidata:Q28]] landlocked country in Central Europe
[[wikidata:Q42]] English writer
[[wikidata:Q44]] alcoholic beverage produced by the saccharification of starch and fermentation of the resulting sugar
[[wikidata:Q46]] world\'s second-smallest continent by surface area
[[wikidata:Q48]] world\'s largest and most populous continent
[[wikidata:Q49]] continent wholly within the Northern Hemisphere and almost wholly within the Western Hemisphere
[[wikidata:Q51]] fifth-largest continent in area after Asia
[[wikidata:Q52]] collaboratively edited
[[wikidata:Q53]] caffeinated carbonated mate-extract beverage made by the Loscher Brewery (_Brauerei Loscher_) near M\xfcnchsteinach
[[wikidata:Q54]] Engrish (broken English) phrase that became an Internet phenomenon or meme
[[wikidata:Q56]] image combining a photograph of a cat with text intended to contribute humour
[[wikidata:Q57]] 1987 song by British singer Rick Astley
[[wikidata:Q58]] general term for the organs with which male and hermaphrodite animals introduce sperm into receptive females during copulation
[[wikidata:Q59]] server-side scripting language designed for web development but also used as a general-purpose programming language
[[wikidata:Q61]] capital of the United States
[[wikidata:Q64]] capital city of Germany and one of the 16 states of Germany
[[wikidata:Q66]] American multinational aerospace and defense corporation
[[wikidata:Q67]] aircraft manufacturing subsidiary of EADS
[[wikidata:Q68]] general purpose device that can be programmed to carry out a finite set of arithmetic or logical operations
[[wikidata:Q69]] municipality in the district of Del\xe9mont in the canton of Jura in Switzerland
[[wikidata:Q70]] _Bundesstadt_ (federal city
[[wikidata:Q71]] second most populous city in Switzerland (after Zurich) and is the most populous city of Romandy
[[wikidata:Q72]] largest city in Switzerland and the capital of the canton of Zurich
[[wikidata:Q73]] protocol for live interactive Internet text messaging (chat) or synchronous conferencing
[[wikidata:Q74]] village in the East Riding of Yorkshire
[[wikidata:Q75]] global system of interconnected computer networks that use the standard Internet protocol suite (_TCP/IP_) to serve several billion users worldwide
[[wikidata:Q76]] 44th and current President of the United States
[[wikidata:Q77]] country in the southeastern part of South America
[[wikidata:Q78]] third largest in Switzerland with a population of 500
[[wikidata:Q80]] British computer scientist
[[wikidata:Q81]] root vegetable
[[wikidata:Q82]] software built into the operating systems that use them
[[wikidata:Q83]] free wiki software application
[[wikidata:Q85]] capital of Egypt and the largest city in the Arab world and Africa
[[wikidata:Q86]] non-specific symptom
[[wikidata:Q88]] independent city in the Commonwealth of Virginia
[[wikidata:Q89]]  pomaceous fruit of the apple tree
[[wikidata:Q90]] capital and most populous city of France
[[wikidata:Q91]] 16th President of the United States
[[wikidata:Q94]] Linux-based operating system<ref name="AndroidOverview"'
[[wikidata:Q95]] American multinational corporation specializing in Internet-related services and products
[[wikidata:Q98]] largest of the Earth\'s oceanic divisions

And this is the main code in use now:

Click [expand] to view the content

import pywikibot
import re
import time
import mediawiki_parser

templates = {}
namespaces = {}
interwiki = {}

from mediawiki_parser.preprocessor import make_parser
preprocessor = make_parser(templates)

from mediawiki_parser.text import make_parser
parser = make_parser(interwiki, namespaces)


site = pywikibot.Site("en", "wikipedia")
q = 1
qend = 100
def stripNewLine(article):
    return article.replace('\\n', '')

def parseDescription (article):
    m = re.match( r"((?!is)|(?!was).)* ((is)|(was)) ((a)|(an)|(the)) ([^,.]*)(.*)$", article, re.UNICODE)
    if m:
        return m.group(9)
    else:
        return False
for i in range(q,qend):
    data = pywikibot.DataPage(i)
    if data.exists():
        dictionary = data.get()
        page = pywikibot.Page(site, dictionary['links']['enwiki'])
        if page.exists():
            source = page.get()[:5000]
            try:
                preprocessed_text = preprocessor.parse(source)
                output = stripNewLine(str(parser.parse(preprocessed_text.leaves())))
                m = parseDescription(output)
                if m:
                    print str(data) + " " + m
            except:
                pass

--Noimnotabot (talk) 22:45, 9 June 2013 (UTC)[reply]

Quick review: Can you strip out <ref/> tags? Also maybe strip the underscores? Legoktm (talk) 20:24, 20 June 2013 (UTC)[reply]

Or, for that matter, all tags? I don't think we should have tags (or even templates) in descriptions. Also, I advise that you prepare dictionary['links']['enwiki'] to catch KeyError and continue in such a case, since not all items wave an sitelink for enwiki. Hazard-SJ ✈ 20:56, 22 June 2013 (UTC)[reply]

Also, is there any reason why your bot's username doesn't comply with the proposal? Hazard-SJ ✈ 04:18, 28 June 2013 (UTC)[reply]

I could do a rename to 'DescriptioncreatorBot' if it is really necessary. Vogone talk 21:21, 7 July 2013 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.