Wikidata:Report a technical problem/WDQS and Search/Archive/2023

This is a subindex of archived discussions from Wikidata:Report a technical problem/WDQS and Search. To place a page in Category:WDQS and Search archive, place {{Archive|category=WDQS and Search archive}} at the top of the page.

May

Report a technical problem/WDQS and Search/Archive/2023/05

July

Report a technical problem/WDQS and Search/Archive/2023/07

Korean search method is inconvenient, it's problem to search

Tracked in Phabricator
Task T323628

Moving this conversation over as it pertains to special:search. - Mohammed Sadat (WMDE) (talk) 16:31, 20 October 2022 (UTC)[reply]

Can I change the related settings? or should I ask to sysops?
I searched some words in Korean on wikidata, but many results bring only spaced words, like many european languages. Korean language is not separated by only spacing. it's same with other east asian languages, like Chinese or Japanese. for example, when I search "北京"(Beijing), it will bring "北京站"(Beijing station) and "北京大学"(Beijing university) etc.
but current wikidata search system is very different general Korean search system. for example, I want search about "제넷" it means "genet", but many names are not spaced. "유럽제넷"(Common genet), "케이프제넷"(Cape genet), "서발린제넷"(Servaline genet) etc... when I searched these things, the search engine bring to me only a word with a spaced, like "제넷 골드스타인"(Jenette Goldstein)
it's very fatal to search in Korean languages. one of problem, for search more efficiently, every time search with Korean word with postposition. if I search about "포유류"(mammals), I can't get many search results. Korean language is a agglutinative language, people have to put a lot of postposition combine on it to search properly in this kind of environment. like "포유류의"(of' mammal), "포유류에"(in mammal), "포유류에게"(to mammal) etc.
This search system is very inefficient to Korean. I want to Korean search system in the same way as Chinese and Japanese, but I can't find the related settings. other languages that use the same system as Korean can do the same. I want to make Wikidata a more convenient site for future users. Thank you for reading this long article! ―파란여우 (BlueFox) (토론 (talk)) 13:02, 15 October 2022 (UTC)

This is the same result in Japanese. For example, if you search for "ウルシ", "ヤマウルシ" and "ツタウルシ" will not appear in the search results. This is because there is no space between the words, I think. I don't understand why similar words like ウィリアム・ウールジー show up, but words that clearly have "ウルシ" in the vocabulary are not searched.--Afaz (talk) 03:39, 28 October 2022 (UTC)[reply]

Thank you for your opinion. I think if any users use spacing changes the way of search. I guess sometimes other languages that use spacing will have a similar events, but I hope it changes as soon as possible. ―파란여우 (BlueFox) (토론 (talk)) 06:12, 6 November 2022 (UTC)[reply]

I work on the Wikimedia Search Platform team and I have some answers and explanations to share, but not any great solutions at the moment.

The short answer is that right now you can somewhat improve search in Korean by choosing Korean as your interface language. However, only Korean descriptions are processed in Korean. Because of an error, Korean labels are not processed in Korean (all labels are processed in English at the moment). We are working on fixing that, which should improve search in Korean and other languages.

The (much, much) longer answer is below.

Handling multilingual data in search is a complex problem. Identifying languages is hard, in general, especially on short strings like most search queries (even though some writing systems, like Korean, are less ambiguous than others). On Wikidata, your query is processed according to your current interface language, which you can change either in your preferences, with the language selector at the top of each page, or temporarily with the "uselang" URL parameter (e.g., adding "&uselang=ko" to the end of most search URLs). "uselang" is useful for sharing links!

You can see the effect of language selection by searching for 포유류에 ("in mammals"). If your interface language is English you get 7 results. If your interface language is Korean you get 129 results. The top results do match the search string 포유류에 exactly, as exact matches are usually weighted more heavily, but plenty of others match on just 포유 (that's the search term that's left after processing 포유류에 in Korean, much like in English, where the search in mammals would be reduced to the search term mammal).

Right now, only the text in the language-specific description is processed in that language. Language-specific labels should be processed in their language, but are not because of a configuration error. Thanks to your question here, we will add that to our tasks to work on.

Not having labels processed in the right language is a big problem, because descriptions usually don't have the name of the item in them. So you may find "mammals" or "birds" in Korean because those words are in the description, but not a specific mammal or bird because its name is not in the description—it's in the label.

As for the difference between Chinese, Japanese, and Korean searching. If you have, say, English as your interface language, they will be processed as English text. The key step is tokenizing a query (or description or other text to be searched). Tokenization breaks up the text into tokens—which are usually words, but not always. We use Elasticsearch (which is in turn based on Lucene) as our search engine for on-wiki search, and the English processing uses the so-called "standard" tokenizer. It is good at breaking up English text and text from many European languages, but it is inconsistent in how it treats East Asian languages (and none of the options are great).

For Chinese (ideographic characters), the standard tokenizer breaks the text into individual characters, so 维基百科 ("Wikipedia") is broken into four tokens: 维, 基, 百, 科. That's good and bad, because it certainly will match 维基百科 in a longer string without spaces, but it will also match the four characters in a different order, or unconnected and spread out through a larger text (though closer together matches are preferred).

For Japanese and Korean, the standard tokenizer breaks on spaces, punctuation, and other places where it would break English. So, longer sentences or phrases without spaces are incorrectly treated as a single long word. On English Wikipedia, this isn't too bad because there isn't a lot of text in Japanese and Korean, and it tends to be names or short phrases. But it's clearly not great for Wikidata.

If you switch your interface language to Chinese or Korean, there are language-specific tokenizers that can fairly accurately break text into words (they aren't perfect, but they are usually better than breaking everything into individual characters or treating a sentence as one long word). There is also additional processing, like removing stop words (usually function words that don't have much meaning; in English they include the, a, an, to, in, of, is, etc.), or converting between Traditional and Simplified Chinese.

In Japanese, things are not as good. Text is broken into overlapping bigrams by the "CJK" language analyzer ("Chinese/Japanese/Korean"). So, ウィキペディア ("Wikipedia") is tokenized as ウィ, ィキ, キペ, ペデ, ディ, ィア. This is a compromise between individual characters (which find too much) and all-one-word tokens (which find too little). The main draw back is that it creates bigrams across word boundaries, which are spurious. We used to use the same for Chinese and Korean—and I know that some Chinese searchers would add word boundaries to their queries to avoid the spurious cross-word bigrams. We've since found better tokenizers (and other components) for those languages. Next month I'm going to investigate something better for Japanese, too. If that goes well it will be rolled out to Japanese-language wikis and eventually to Wikidata, too. —TJones (WMF) (talk) 23:22, 16 November 2022 (UTC)[reply]

@AquAFox, Afaz: The changes to reindex labels have been made, and I've verified that the Korean examples above return some the previously missing results (if your interface language is Korean). Some Japanese examples should be better, too, particularly when you are looking for a substring of a longer label. So, individual words and short phrases should work better. More complex queries may not work as well because we are still using the bigrams, as described above. (The work on better tokenization for Japanese has been delayed.) TJones (WMF) (talk) 16:52, 29 March 2023 (UTC)[reply]

command line tool for big SPARQL queries

On Linux I've been using curl -s -H "Accept: text/csv" -G 'https://query.wikidata.org/sparql' --data-urlencode query="$(< overview.rq)" for sometime to get my data out. I've now made my query more complicated, so its 76 lines, 5000 characters, and I'm getting "414 Request-URI Too Large" too large responses. I tried the wdq command line tool, but it didn't like anything but the most trivial queries, and wikidata-dl returned IDs, not a table. How can I run my query from the command line and get a csv file? Vicarage (talk) 13:41, 1 January 2023 (UTC)[reply]

Apparently most servers have an 8 kiB limit on get requests. The solution is to use a post request, where the default limit on nginx is 1 MiB, so drop the "-G" parameter. E.g.: curl -H "Accept: text/csv" --data-urlencode "query@myquery.rq" where myquery.rq is the name of a text file with your query in it. Infrastruktur (talk) 16:12, 1 January 2023 (UTC)[reply]

Thanks, that works. Vicarage (talk) 17:22, 1 January 2023 (UTC)[reply]

@Vicarage: Unrelated, but please remember to always set a user agent header as well (i.e. something like -H 'User-Agent: ...'). Lucas Werkmeister (WMDE) (talk) 10:13, 4 January 2023 (UTC)[reply]

OK Vicarage (talk) 15:44, 4 January 2023 (UTC)[reply]

Auto completion

In the browser based "Wikidata Query Service", auto completion has stopped working for me. Control-Space no longer produces searchable hints. It worked fine until about the third week of December.

I also tested the command to comment-out a line (control /), and that still works as expected.

MacBook Pro tested with latest version of Chrome and Safari.

Has anyone else experienced this? Brainbout (talk) 19:54, 2 January 2023 (UTC)[reply]

@Brainbout: currently works for me on MacOS 11.7.2 and Safari 16.2. (And works on firefox; don't have chrome). --Tagishsimon (talk) 20:15, 2 January 2023 (UTC)[reply]

cirrussearch-too-busy-error

Tracked in Phabricator
Task T326757

Moving this conversation over as it pertains to cirrussearch. -Mohammed Sadat (WMDE) (talk) 12:30, 9 January 2023 (UTC)[reply]

Consistently getting "cirrussearch-too-busy-error" at the moment, after about 30 something requests. Running a script which uses cirrussearch to gather statistics for reports. It respects maxlag, and uses aiohttp. Even if I reduce the number of requests that can be done in parallel from 4 to 1, I get this error. What's going on? Infrastruktur (talk) 19:37, 2 January 2023 (UTC)

@Infrastruktur: "cirrussearch-too-busy-error" means that you hit the circuit breaker that prevents running too many concurrent requests in the datacenter. This system is not configured to limit on a per-user/IP basis but per "type-of-query" basis. E.g. we allow X concurrent requests per "type-of-query" for the whole datacenter. Errors are reported here under the "Pool Counter Rejections/seconds" graph. Sadly we are rejecting quite a few searches in the "Search" pool these days, the "Search" pool is used for fulltext searches (e.g. using action=query&list=search). Sadly there's nothing much we can do as it means that the service is too busy and the only thing you can do on your side is wait and retry. If you believe that it is your tool triggering such errors I would strongly suggest to double check how you are performing the requests (because such errors are affecting all users). Please don't forget to follow the meta:User-Agent_policy. (Note that the maxlag is not related to the load on the search servers and can't be used to predict such problems) DCausse (WMF) (talk) 18:43, 11 January 2023 (UTC)[reply]

We believe the rejections you encounter might be exacerbated by some automated clients coming from AWS. The attached ticket is about to mitigate this by grouping all the requests from known public cloud services in their own pool. DCausse (WMF) (talk) 08:07, 12 January 2023 (UTC)[reply]

Coolidge Auditorium (Q115608572) strange behavior in query

Tracked in Phabricator
Task T329064

Tracked in Phabricator
Task T329089

Hi. The item for Coolidge Auditorium is not working correctly in queries.

SELECT ?work ?workLabel ?firstPerfPlace ?firstPerfPlaceLabel ?date
WHERE { 
    ?work    wdt:P88          wd:Q113611943 .
    
  OPTIONAL {
           ?work wdt:P4647 ?firstPerfPlace .
           ?work wdt:P1191 ?date .
           }

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
}

Try it!

The label for Q115608572 is not showing correctly in ?firstPerfPlaceLabel. It shows the Q number even though an English language label exists.

Also if I run this simple query, Coolidge Auditorium item is not included in the results.

This item was deleted then restored in January 2023, so perhaps that is causing this error.

Thanks for your help. Metadatum (talk) 22:17, 2 February 2023 (UTC)[reply]

It - Coolidge Auditorium (Q115608572) - does seem to be a borked record in WDQS.

SELECT ?work ?workLabel ?predicate ?value

WHERE { 
  VALUES ?work {wd:Q115608572}
  OPTIONAL {?work ?predicate ?value.}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } 
  }

Try it!

--Tagishsimon (talk) 08:12, 3 February 2023 (UTC)[reply]

Thanks for the report, indeed this entity was not properly recovered in WDQS after it was undeleted. The reason is that the internal events transporting the information that the entity is restored stopped to be emitted. Additionnaly the inconsistency detection mechanism that is supposed to have restored this entity after subsequent edits were made did not work as expected. The former issue should be solved, the later requires more investigation. Please see the attached tickets for more details. DCausse (WMF) (talk) 17:02, 7 February 2023 (UTC)[reply]

Can we have the EU graph whitelisted?

Tracked in Phabricator
Task T331271

couldn't find it in the list. endpoint at https://data.europa.eu/sparql Infrastruktur (talk) 16:29, 27 February 2023 (UTC)[reply]

I filed a ticket to allow this endpoint. DCausse (WMF) (talk) 09:51, 6 March 2023 (UTC)[reply]

New federated query service

Tracked in Phabricator
Task T334823

Moving this question over as it pertains to the WDQS. -Mohammed Sadat (WMDE) (talk) 11:23, 13 April 2023 (UTC)[reply]

I want to know about the procedure for registering a new sparql enpoint service (external source).
According to this page there is a list of federated endpoints from mediawiki, and it includes the one I need (https://opendata.aragon.es/sparql) but it I think it doesn't work from wikidata endpoint
e.g. query
Response is:
Service URI https://opendata.aragon.es/sparql is not allowed Dportolesr (talk) 15:35, 24 March 2023 (UTC)

@Dportolesr: the mw:Wikidata_Query_Service/User_Manual/SPARQL_Federation_endpoints page should be changed only after the endpoint was enabled, I believe that someone added this endpoint directly there without asking to add it. I removed it for now and created phab:T334823 to add it, once the ticket is resolved please feel free to edit it and this endpoint again. Thanks!

Concerning the procedure to request a new endpoint sadly it remains unclear and fuzzy. A couple years ago Wikidata:SPARQL_federation_input was used but it stopped being monitored properly and the discussions to establish a new procedure stalled. Practically speaking, as a maintainer of WDQS, I think that filing a phabricator ticket with the tag Wikidata Query Service or posting a question here has proven to be the process that seems to work. DCausse (WMF) (talk) 08:20, 17 April 2023 (UTC)[reply]