ويكي بيانات:مراقب مفاهيم ويكي بيانات

This page is a translated version of the page Wikidata:Wikidata Concepts Monitor and the translation is 70% complete.

[ملاحظة] تتم مجموعة تعليقات المستخدم لهذا النظام هنا. قم بزيارة مجلة مموب للحصول على أحدث النتائج حول استخدام ويكي بيانات عبر مشاريع ويكيميديا.

مراقب مفاهيم ويكي بيانات (WDCM) (المعروفة أيضا بـ: Q42376073) هي أداة تحليلية تمكنك من استعراض وبناء فهم طريقة استخدام ويكي بيانات عبر مشاريع ويكيميديا.

تقدم هذه الصفحة نظرة عامة غير فنية على مموب، يتم العثور على التفاصيل الفنية في صفحة ويكي التقنية المقابلة.

This page does not stand as a project presentation merely - or at least it is not intended to be anything like that. An interested reader will find all the essential facts about the project listed here, true. However, the true intention here is to show you how to use the Wikidata Concepts Monitor statistical system to help you discover the wonderful, immensely complex universe of Wikidata usage across some 800 client projects in the Wikimedia ecosystem. The WDCM is designed to become a path towards discovery: following the examples listed here not only that you can learn to work with a system that might improve your understanding of Wikidata, but you could also find yourself involved in adventurous attempts to learn and discover more of it. The page encompasses several examples of WDCM usage and provides two more elaborated use cases. We recommend that you take the path as described in the examples and the use cases in the beginning of the page before starting your own research with the WDCM.

For those interested in technical details. Those interested in the technical details should visit the WDCM Wikitech Page: it should be providing enough information in respect to how WDCM works. To put it in a nutshell, the current version of the WDCM system is developed in R and Python, and supported by Apache Hive, Apache Spark, and Apache Sqoop to enable Big Data processing of the wbc_entity_usage tables that provide for Wikidata usage client-side tracking over sister projects. MariaDB runs the WDCM dashboards back-end support, while the dashboards themselves are built in the RStudio Shiny framework and hosted by an open source version of the RStudio Shiny Server. The WDCM Engine scripts perform many data pre-processing procedures before the machine learning phase takes over to deliver the results to the front-end, utilizing Latent Dirichlet Allocation and t-SNE among other algorithms. The front-end data visualizations are developed primarily in {ggplot2}, {visNetwork}, and {rBokeh}.

مموب يحب R (Q206904): لغة تواصل مشترك لـdata science (Q2374463) ولغة التطوير الرئيسية لمموب.

الاستخدام: الحياة الخارجية لويكي بيانات

في 14 نوفمبر 2017، في وقت مبكر من ساعات توقيت وسط أوروبا، يكون استعلام SPARQL (Q54871) التالي

SELECT (count(*) AS ?count) WHERE { ?item wdt:P31/wdt:P279* wd:Q13442814. }

run from our Wikidata Query Service resulted in a count of 9,301,454 items under scholarly article (Q13442814). However, the Wikidata Concepts Monitor (Q42376073), which is a piece of statistical software developed in R, reports from its Usage Dashboard that that the semantic category - or a concept - of a scientific article is used on only 1,762 distinct pages across all editions of Wikipedia.

The concept of a scholarly article (Q13442814) has a life of its own: it is a part of (P361) peer reviewed proceedings (Q16735857), scientific journal (Q5633421), and scientific publication (Q591041); it has cause (P828) of academic writing (Q4119870); is is the equivalent class (P1709) to ScholarlyArticle from schema.org; it is different from (P1889) an academic journal article (Q18918145). The description of what it means to be a scientific article in Wikidata constitutes its inner life as a Wikidata item. On the other hand, the fact that it is used on only 1,762 distinct pages across the Wikipedia tells us something about its outer life: simply, what happens to a scientific article when it goes out and starts playing a role in Wikipedia and its sister projects' pages.

WDCM Example 1. In order to reproduce this result from the WDCM, (1) visit the WDCM Usage Dashboard, (2) select Tabs/Crosstabs while at the Dashboard tab, select _Wikipedia in the Search projects field, (3) select Scientific Article in the Search categories field, (4) click Apply Selection, and (5) download the .csv file (that you can open directly in Libre Office Calc, Microsoft Excel, and similar applications) by clicking the Data (csv) button beneath the Categories chart to the left.

لوحة بيانات استخدام مموب: ما هي مشاريع ويكيبيديا التي تستفيد من المقالات العلمية (Q13442814) أكثر؟

عن القطط والفئران

There is an individual, Brunhilda, and a category, house cat (Q146), of which Brunhilda is an P31 (or an instance of, if you prefer). In Wikidata, both are items. In cognitive science and cognitive psychology, one would typically say that there is a concept of a cat - a mental representation of a class of fury animals - of which Brunhilda is an exemplar (being a concept itself, of course). Concepts are considered to be extremely complicated mental entities that form the building blocks of human thought. They are said to be stored in our individual semantic memories. However, what exactly is that we store about concepts - what forms the basis of psychological semantics - and how concepts are used is critically determined by discourse: how concepts relate to the empirical world that surrounds us (a question partly answered by semantics) and how we use them to exchange with others what we know about the World and what we want to do with it (pragmatics).

Wikidata is an attempt to build an abstract formal world of entities and relations that is powerful enough to express many possible truths about the Universe (Q1)} in a rather flexible way. It is populated by new pieces of knowledge every day, evolving on its own. The usage of Wikidata in encyclopedic articles on Wikipedia (or any other pages of its sister projects) is a different story altogether. Tom Cat (Q1839152) and Jerry (Q1962394) are a cat and a mouse, certainly related by their belonging to the very abstract category of animal (Q729) ("kingdom of multicellular eukaryotic organisms"). At the same time, Tom is a fictional cat (Q27303676), while Jerry is a fictional mouse or rat (Q24668268). Imagine that we were able to discover whether the writers of Wikipedia use Tom and Jerry more often together (a) in the discussions of animated films, children, or entertainment industry, or (b) in the context of multicellular eukaryotic organisms and theoretical biology in general?

Well, that is exactly what WDCM is meant for. While we could read a potentially large number of Wikipedia articles (or at least in theory) in a search for Tom and Jerry, doing qualitative content analysis along the way, and decide what is the most typical context of usage for these two fictional mammals, WDCM runs its statistical machinery over big data (Q858810) to learn about many contexts of usage for many semantic concepts in order to help us answer questions like this and similar. In its operation, WDCM is fully dependent upon the nature of the data that we provide to it: it counts how often and where do we use some particular Wikidata items to feed its statistical engine, which means that in the end it provides a reflection of the interests, strategy, spontaneous associations, and reasoned usage of concepts on the behalf of the large community of editors on Wikimedia projects.

الغوص في سياق الاستخدام

Let's use WDCM to discover more on how people make use of human (Q5) across the Wikimedia projects.

WDCM Example 2. For example, you can now visit WDCM Semantics Dashboard, and then (1) select the category Human in the Select Semantic Category field on the Semantic Models tab, (2) Select Semantic Topic: Topic 1, and discover that there is one potentially interesting context of usage of the Wikidata items in the category Q5 (human) that encompasses mostly "important historical figures and politicians". Now, if you only change Topic 1 to Topic 2 in the Select Semantic Topic field, you will discover another interesting context of usage that can be loosely described as "celebrities", where you will find many popular singers and actors, among others. Take a look at Topic 3: what context of Q5 usage does it represent?

لوحة معلومات علم الدلالات في مموب، التصنيف:الإنسان، الموضوع: 3.

What have we learned? The WDCM has figured out about these different contexts of Q5 items usage by inspecting the way their usage distributes across many Wikimedia pages and projects. We already know that it must be that there are some projects and communities that were interested to write about historical events and politics, on one, and some other that focused their attention more on fun, entertainment, and arts, on the other hand: Topics 1 and 2 of Q5 usage. Take a look at Topic 3: you will find many scholars and writers there, and most of them French (at least among the most important items in this semantic topic). Now, scroll down to the bottom of the dashboard page to discover what Wikimedia project seems to be the most prominent ones in this context of Q5 items usage (and that would be French and Breton Wikisource), and think: does it make sense?

لوحة معلومات علم الدلالات في مموب، أهم المشاريع في تصنيف:الإنسان، الموضوع: 3.

There are important lessons to learn about Wikidata usage here already. While the prominence of the French Wikisource makes sense in respect to the usage of Q5 items in this context, as well as the prominence of the Breton (Q12107) (a severely endangered language spoken in Britanny, France) Wikisource, the prominence of the Czech (ranked third), Estonian, or Russian Wikisource projects does not necessarily make sense prima facie. This finding tells us the following: there must be a community of editors out there that have shown a particular interest in French culture on the respective projects. If you go to the Project Semantics tab on this WDCM dashboard, enter only rowikisource in the Select Projects field, and click Apply Selection, a chart will be produced from which you can learn that the dominance of this context of Q5 usage on rowikisource is some 7.85%, compared to Topic 1 that reaches almost 87%. First lesson: on many occasions, a particular semantic context of Wikidata usage is really strong in few projects only. Second lesson: Wikidata usage is determined not only by what one would expect to be logical in some sense - be that logic of a purely formal-semantic or of a cultural nature - but by what happens in the contributing communities as well. One consistent editor who is interested in a particular topic, not to mention a group of them, can change the semantic context of a project significantly.

WDCM Example 3. Go to the WDCM Semantics Dashboard, and (1) Select Category: Geographical Object from the Semantic Models tab, Select Semantic Topic: Topic 6. The first chart produces by the dashboard will tell you that items located in China play a significant role in this context of Wikidata usage of the semantic category of geographical objects. Now (2) scroll down to the bottom of the dashboard page and have a look at the Wikimedia projects in which this context of usage is important: zhwiki, zhwikisource, iiwiki... Surprised? Sometimes, the results that WDCM brings back receive a straightforward, direct interpretation, like this one. A context of Wikidata usage that is characterized by geographical entities in China is found, and mainly projects in languages that are characteristic of the Chinese culture make use of Wikidata items in that context. However, most of the time the situation is much more complicated and a myriad of factors influencing the nature of the context and its distribution across the projects must be taken into account to understand its origin and make sense of it.

تداوليات ويكي بيانات

بكل بساطة: في حين أن الحياة الداخلية لويكي بيانات تدور حول بنيته

نموذج بيانات، أو ontology (Q324254) الخاص به، و

the introduction of new items, properties, statements, qualifiers, references, labels... alongside the debate on how the later (should) relate to the former and what possible instantiation of the permissive Wikidata structure reflects the empirical world in a most desirable way,

الحياة الخارجية لويكي بيانات هي كل شيء عن الطريقة التي تستخدم بها مجتمعاتنا كل هذه المعرفة المنظمة والمطورة بشكل تعاوني عبر ما يقرب من 800 مشروع تقوم حاليا بتتبع استخدامها على صفحاتها.

Wikidata is a symbolic system, and such its definition must encompass both semantics and syntax. However, there is a third component of any natural symbolic system: its pragmatics. In an analogy to the study of natural language, where pragmatics is defined as "... a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning...", WDCM is meant to become our method to study how the editors map the content and the formal structure of Wikidata to the page content of Wikimedia projects. As a consequence of having such a method at our disposal, we can begin to learn how Wikidata is used, i.e. how the meaning of the knowledge it stores gets altered by its contextual usage across the pages and projects - the usage that is mediated through the minds of Wikimedia contributors.

أشياء مهمة جدا عن هذا النظام

If you think you could make use of the WDCM and enjoy learning about the Wikidata usage in the Wikimedia universe, you probably need to prepare yourself to encounter a rather complex world of findings and reports. We hope for the WDCM system to become a path towards discovery. However, the path is not straightforward. WDCM is the first step towards building an understanding of the highly complicated structure of Wikidata usage. This system can help you discover what Wikidata client projects are similar and in what respect, what semantic categories of items are used more or less frequently across the projects, how do items connect in respect to how similarly they are used by our communities, what are the most popular items per project, and many more (hopefully) interesting things. If used properly and with understanding, it can be your navigation tool in the immensely interesting and complex field of Wikidata user behavior.

بشكل عام، ما يجب عليك دائما أن تضعه في اعتبارك أثناء تصفحك للوحة معلومات مموب: هو ما يلي:

لا تدرس مموب كل ويكي بيانات

النسخة الحالية من مموب لا تتضمن جميع عناصر ويكي بيانات، هذه الحقيقة ليست بسبب القيود التقنية بقدر ما ترتبط بالقيود المنهجية، والتي ستتم مناقشتها قليلا أدناه، مزيد من القراءة: في مكان ما في هذه الصفحة نناقش تصنيف مموب: وهو مبدأ لتحديد العناصر التي يتم تتبعها لاستخدامها عبر مشاريعنا والتي تخضع لتحليلات مموب.

مموب لاأدري فيما يتعلق بهيكل ويكي بيانات

WDCM is agnostic in respect to the structure and contents of Wikidata in itself. Example: visit the Semantics Dashboard, on the Semantic Models tab do Select Semantic Category: Architectural Structure, Select Semantic Topic: Topic 5. In the first chart you will discover how Project Gutenberg (Q22673) is highly ranked (fourth position) in respect to its importance in this context of usage of architectural structures. It's not WDCM, it's the way people use Wikidata: Project Gutenberg (Q22673) is a instance of (P31) of digital library (Q212805), which is a subclass of (P279) of library (Q7075), which is a subclass of (P279) of building (Q41176), which is in turn a subclass of (P279) of architectural structure (Q811979), thus making Project Gutenberg (Q22673) an architectural structure (Q811979). In other words: we don't care about how you use Wikidata as long as we can study its usage.

ما يؤثر على طبيعة السياقات الدلالية التي اكتشتفها مموب

Well, this is rather important if you plan to understand what WDCM can do for you.

The core algorithm. In order to discover the semantic topics (i.e. contexts of Wikidata usage) across some 800 Wikimedia projects and a selection of semantic categories of items from Wikidata, the WDCM employs a standard algorithm used in text-mining and Natural Language Processing know as the Latent Dirichlet Allocation (LDA). While understanding the mathematical and computational details of the way LDA works is not essential for a WDCM user, reading through the less technical Wikipedia page on Topic Models - a general class of mathematical models used in text categorization - might prove to be helpful. For those who do the reading: it's just that we don't use the classic term-frequency matrix, but a project-item usage frequency matrix instead. The nature of this algorithm, of course, heavily influences what semantic contexts of Wikidata usage will be discovered.
The nature of the Universe. Of course, discovering that projects written in the languages of China are ranked highly in a semantic context characterized by items characteristic of the Chinese culture is exactly what one would expect to happen. WDCM tends to group similarly used things together. From time to time only, its results will match quite precisely you everyday expectations about the Universe. However, WDCM will do an even more important thing to you by showing you what information are you missing in order to fully understand the world of Wikidata (if that is possible at all).
Idiosyncratic phenomena. Let's study the following example for a while. It introduces a rather unusual situation from which we can learn how the nature of the WDCM system in itself introduces additional constraints in the interpretation of its results. Note: this is very complicated example, but be prepared to encounter many similar things during your journey into Wikidata usage with the WDCM, so it is highly recommended to study it.

WDCM Example 4. Go to the WDCM Semantics Dashboard, and (1) Select Category: Event from the Semantic Models tab, Select Semantic Topic: Topic 4. The first chart produces by the dashboard will tell you that the item 2014 Indian general election in Tamil Nadu (Q15894105) plays the most prominent role in this context, followed by a list of items mostly about Giro d'Italia (Q33861) (?!!) whose importance in this semantic topic (look at the x-axis!) is far, far less than the one of the first ranked item. What in the world do the elections in India have to do with the Giro d'Italia - a cycling road race held in Italy? Scroll down to the bottom of the dashboard page to learn about the Wikimedia projects in which this context of usage is important, and you will see that only tawiki and arwiki (again, take a look at the x-axis of the plot) are significantly interested in this topic, followed by a list of projects that barely make use of it (itwiki and trwiki ranked among the highest). So, the first thing that we learn is that this context presents something rather specific. We have inspected Wikidata and found out that we can explain why itwiki and trwiki are highly ranked in this semantic context: they consistently make use of many of the Giro d'Italia items from Wikidata. However, it remains unclear what brings together 2014 Indian general election in Tamil Nadu (Q15894105) , Gulf War (Q37643), and many Giro d'Italia items. From the WDCM Usage Dashboard, we have used the Project Report section on the Usage tab to find out that the top projects in this semantic context are indeed found among the Wikimedia projects that make use of Wikidata the most (tawiki ranked 16th, arwiki holding the 301. place - not bad, almost in the upper third of projects in respect to Wikidata usage, itwiki on the 13th position, and trwiki on the 39. place, all in respect to the total usage of Wikidata per project). Thus, the finding is not a consequence of merely having sparse data. Finally, we were able to understand the nature of this semantic context only by looking at the WDCM under the hood to find out on how many distinct pages on tawiki was the item of 2014 Indian general election in Tamil Nadu (Q15894105) used, and the number is: 9950, a rather high usage statistics. For some reasons, the community around tawiki was very focused on this event at some point in time. The WDCM has discovered this fact by means of statistical learning and separated this context of Wikidata usage in a semantic topic per se, in order to mark that something very specific but highly representative of the tawiki project has happend: in fact, the item under discussion is the 5-th ranked Wikidata item in respect to its usage on this project. Now, the question: why haven't the first four most frequently used items on tawiki influenced the result in this way? We have first visited the WDCM Semantics Dashboard: on the Project Semantics tab, select tawiki, and you will learn that it scores 100% in this very semantic context. Next, we again went under the hood of the WDCM and inspected source data to find out that the item of 2014 Indian general election in Tamil Nadu (Q15894105) is the only Wikidata item on tawiki with any significant usage in the semantic category of Events at all. And that is the message that the WDCM had for us: there is a very specific context (Topic 4 in Events) in which a single item (Q15894105) from a particular category (Event) is almost exclusively used in a particular project (tawiki). Again, question: what do the Gulf war and Giro d'Italia have with all this? The answer is: probably nothing. Under the theoretical model employed in WDCM, the one upon which the LDA algorithm is based, all items from a particular semantic category play some role in each of the discovered semantic topics (i.e. contexts of usage). In other words, no matter how specific a particular semantic context is, and the one under discussion is quite specific, all items must fit into it and be represented by some importance score in it (actually, it's the probability of them being used in this context). The various Giro d'Italia items and the Gulf war simply turned out be the at the top of the procession of a large number of very, very small item importance scores following the importance of 2014 Indian general election in Tamil Nadu (Q15894105) in this highly specific context. Indeed, the conclusions is the following one: (a) there is a highly specific context describing the dominant usage of one single item from the Event category on tawiki, and (b) the rest of the information in this context can be treated as a statistical artifact with not too much importance in the interpretation of the finding.

Complicated? Well, Wikidata usage in itself is a behavioral phenomenon of immense complexity. The WDCM can help you reduce that complexity a bit and navigate through it, but it won't do the research and thinking part on your behalf. Do not expect that this system will explain the patterns of Wikidata usage in any way. It was built as a methodological tool, a measurement instrument, a portal to access the data and categorize them in the statistically most convenient way before they are presented to you. The Hubble Space Telescope (Q2513) helps us to observe the Universe, but the results that we obtain from its observations undergo careful and painful processing and discussions on the behalf of the scientific community in order to build theories and hypotheses about the physical world. Ask yourself: what is more complicated, the physical universe, guided by the laws of physics, or the semantic universe, guided by the interaction of billions of human beings online with all different cultural backgrounds, education, cognitive styles, information that they can access, points of view, and interests? The WDCM can make observations of this immense complexity and provide some means to help you reduce to a (hopefully) manageable proportion. However, you still need to view it and use it as an instrument only while doing the research part on your own. We hope that this call is challenging enough.

To summarize: the system will produce a finding based on whatever data on Wikidata usage it has, and you have to inspect the result carefully to understand whether they make sense, how specific they are, or if they simply present a "statistical artifact". The specifics of particular projects, as illustrated in the previous WDCM Exercise, do not end here. For the same tawiki project, go visit the WDCM Semantics Dashboard, and select Semantic Category: Human and Topic: Topic 8. You will discover another semantic topic that is of practical importance for the understanding of tawiki merely. The moral of the story: Wikidata usage is not about what you expect that the editors will do from the perspective of your own conceptual organization of the Universe, but about what different individuals and communities do with Wikidata across the Wikimedia projects. The WDCM can bring you back many interesting results on the later, and very little on the former - only up to some degree of match between the mind of a semanticist or a formal ontologist and what people actually do with Wikidata out there.

الخصائص التي تشكِّل كيفية استخدام المحررين والمجتمعات المحلية لويكي بيانات، من الواضح أن هذا هو ما تدور حوله اللعبة، ربما من المستحيل سرد جميع العوامل التي تؤثر على أنماط استخدام ويكي بيانات عبر 800 مشروع، بما في ذلك بعض من بين الأماكن الأكثر ديناميكية على الإنترنت على الإطلاق.

- التأثيرات التاريخية، مثل ما إذا كان حدث معين قد شكل الثقافة أو النظام التعليمي لمجتمع اجتماعي لغوي معين (الذي يدير بشكل هائل مشروعا أو أكثر) لتنظيم مفاهيمه بشكل محدد، والذي ينعكس بعد ذلك في نمط معين من استخدام ويكي بيانات.

- مصالح ودوافع محرر معين، بالطبع: إذا كان هناك محرر له مصالح في الغولف والأدب الإيطالي متناسق في الوقت المناسب، وبالنظر إلى أن المحرر يظهر درجة الاستمرارية في استعماله لويكي بيانات، ليس هناك نهاية لما يمكنهم فعله، بما في ذلك هذه التغييرات في نمط استخدام ويكي بيانات التي ستنعكس في اكتشاف مواضيع دلالية جديدة (مثل استخدامات ويكي بيانات).

- الوصول إلى المعرفة والتغيرات المحلية/الثقافية في تنظيم المعرفة: المجتمع أ يعتقد أن جميع ظواهر العالم الحقيقي س من فئة معينة س يجب أن تكون مرتبطة بعناصر ويكي بيانات معينة، في حين أن المجتمع ب يظل متساويا في إنشاء روابط نحو مجموعة أخرى من عناصر ويكي بيانات أثناء الكتابة عن نفس الظواهر، ونتيجة لذلك; فإن أنماط استخدام ويكي بيانات لهذه المجتمعات يمكن أن تقود مموب لاكتشاف الدلالي

سياقات دلالية لا يمكن تفسيرها بطريقة مباشرة، ولكن يجب أن تُفسَّر على أنها خليط من اثنين، من المحتمل أن يكونا متعارضين، وتفسيرات بعض مجال معين من المعرفة.

- المدخلات التلقائية: مرة أخرى، لا يوجد حد لما يمكن أن يفعله بوت، سيعكس نمطه في استخدام ويكي بيانات بنية الخوارزمية الأساسية، والتي بدورها تعكس معارف ومعتقدات مؤلفيه، وهي حالة التحقيق فيها صعب.

- الوصول، وسهولة الاستخدام، وتوافر المصادر: إذا كان المجتمع Aأ، الذي يعمل بلغة معينة أو مجموعة من اللغات، لديه الوصول إلى المصادر ببعض اللغات الأخرى " '، ربما يستخدم ويكي بيانات بطريقة تعكس هذه الحقيقة، بغض النظر عن أن معرفته قد لا تتطابق تماما مع ما هو موجود في المصادر المتاحة لهم، هذه القائمة ليست شاملة بالتأكيد.

تعريف إحصائيات مموب الرئيسية: تعريف إحصاء استخدام عنصر ويكي بيانات الحالي هو عد عدد الصفحات المتميزة في مشروع ميدياويكي معين حيث يتم استخدام عنصر ويكي بيانات المعني، هذا التعريف مدفوع بالقيود الحالية في تتبع استخدام ويكي بيانات عبر مشاريع العميل (انظر قاعدة الويكي/مخطط/استخدام كيانwbc)، مع أنظمة تتبع استخدام ويكي بيانات أكثر نضجا، سيصبح التعريف خاضعا للتغيير، ومع ذلك، نظرا لتعريف دالة المفتاح الأساسي; فإن نتائج إجراءات التعلم الإحصائي لمموب ستتغير بالضرورة أيضا.

نظام مموب

The WDCM system encompasses two components, of which the second one is meant for its users to interact with: (1) The WDCM Engine, which encompasses a set of R/HiveQL/SQL scripts that collect the data while providing ETL and machine learning until they are ready to feed the WDCM Dahsboards databases, and (2) The WDCM Dashboards, a set of (hopefully) user-friendly dashboards were data and the results of their statistical modeling can be visualized and downloaded. This page is about the second (2) component of the system. If you are interested to learn about the WDCM Engine, the Wikitech page should be telling enough.

The WDCM system is developed by Goran S. Milovanović, Data Scientist, Wikimedia Deutschland, with a help of many people to prepare complex ETL procedures and productionize the system, such as Dan Florin Andreescu, Software engineer, Wikimedia Foundation, and Adam Shorland, Software Developer, Wikimedia Deutschland. Lydia Pintcher, Product Manager of Wikidata, Wikimedia Deutschland, supervised the development of the system and contributed the currently used WDCM Semantic Taxonomy that the system relies on. The software development of the WDCM system is supervised by Tobias Gritschacher, Engineering Manager, Wikimedia Deutschland, while Jan Dittrich, UX Design / Research, Wikimedia Deutschland supervises the UI/UX aspects.The write-ups of the previous experiences in managing Shiny Dashboards on behalf of Mikhail Popov and the team that built our Discovery Dashboards were very helpful in the development of the WDCM Dashboards. Of course, enlightening discussions with Aaron Halfaker, Research Scientist, Wikimedia Foundation, and his team.

In order to be able to use the WDCM system in a way it was meant and designed to be used, i.e. with a clear understanding of what is it built for and why it was built that way, you probably need to get to learn about some important WDCM definitions (and the constraints that dictated them) first. You can do that by reading through the Definitions section of the WDCM Wikitech Technical Documentation. Do not panic, please: it is written in a language that a non-technical person who does not necessarily care about Data or Cognitive Science can understand.

من الواضح أن الإصدار الحالي من نظام مموب يركز على استخدام عنصر ويكي بيانات، لا تتعقب النسخة الحالية للنظام تحليل استخدام الخصائص والمؤهلات وما إلى ذلك.

بالطبع، نرحب بأية أفكار أو مساهمات، إذا كان لديك أي شيء في ذهنك، تُرجَى زيارة صفحة النقاش وتعديلها.

تعريفات مموب

يتم استخدام المصطلحات التالية بشكل متكرر على لوحات تحكم مموب ولها معنى محدد في سياق هذا النظام:

تعريف إحصائيات مموب الرئيسية: تعريف إحصاء استخدام عنصر ويكي بيانات الحالي هو عد عدد الصفحات المتميزة في مشروع ميدياويكي معين حيث يتم استخدام عنصر ويكي بيانات المعني، وبالتالي، يتجاهل التعريف الحالي جوانب الاستخدام (L، S، X، O، T) بالكامل، هذا التعريف مدفوع بالقيود الحالية في تتبع استخدام ويكي بيانات عبر مشاريع العميل (انظر قاعدة الويكي/مخطط/استخدام كيانwbc)، مع أنظمة تتبع استخدام ويكي بيانات أكثر نضجا، سيصبح التعريف خاضعا للتغيير.

المصطلح حجم استخدام ويكي بيانات محجوز لإجمالي استخدام ويكي بيانات (أي مجموع إحصائيات الاستخدام) في مشروع عميل معين، أو مجموعة مشروعات العميل، أو التصنيفات الدلالية.

By a Wikidata semantic category we mean a selection of Wikidata items that is that is operationally defined by a respective SPARQL query returning a selection of items that intuitively match a human, natural semantic category. The structure of Wikidata does not necessarily match any intuitive human semantics. In WDCM, an effort is made to select the semantic categories so to match the intuitive, everyday semantics as much as possible, in order to assist anyone involved in analytical work with this system. However, the choice of semantic categories in WDCM is not necessarily exhaustive (i.e. they do not necessarily cover all Wikidata items), neither the categories are necessarily mutually exclusive. The Wikidata ontology is very complex and a product of work of many people, so there is an optimization price to be paid in every attempt to adapt or simplify its present structure to the needs of a statistical analytical system such as WDCM. The current set of WDCM semantic categories is thus not normative in any sense and can become a subject of change in any moment, depending upon the analytical needs of the community. The currently used WDCM Taxonomy of Wikidata items encompasses the following 14 semantic categories: geographical feature (Q618123), organization (Q43229), architectural structure (Q811979), human (Q5), Wikimedia Internal which encompasses Wikimedia category (Q4167836), Wikimedia disambiguation page (Q4167410), Wikimedia template (Q11266439), work of art (Q838948), book (Q571),gene (Q7187), scholarly article (Q13442814), Chemical Entities that encompass chemical element (Q11344), chemical compound (Q11173), and chemical substance (Q79529), astronomical object (Q6999), thoroughfare (Q83620), event (Q1656682), and taxon (Q16521). All respective SPARQL queries used to fetch the item IDs from Wikidata in the respective categories have the same form: wdt:P31/wdt:P279*. In other words, they look for all the instances of a particular class of items, and search the whole data structure through sub-class relations until the most abstract, target level of categorization is reached.

لوحة معلومات النظرة العامة في مموب: التصنيفات الدلالية الـ14 لعناصر ويكي بيانات التي يتم تضمينها في الإصدار الحالي من تصنيف مموب. كل فقاعة تمثل تصنيفا دلاليا لويكي بيانات، تمثل هذه التصنيفات إحدى الطرق الممكنة لتصنيف عناصر ويكي بيانات، يعكس حجم الفقاعة حجم استخدام ويكي بيانات من التصنيف المعني، إذا تم العثور على تصنيفين بالقرب، فإن هذا يعني أن المشاريع التي تميل إلى استخدام هذه الطريقة تميل إلى استخدام التصنيف الآخر، والعكس.

By project type we mean: Wikipedia, Commons, Wikivoyage, Wiktionary, Wikiquote, etc.

حالات استخدام مموب

While the Overview Dashboard presents - as the name suggests - only the most robust, top-level patterns of Wikidata usage, and is meant as a sort of a "big picture" presentation of current Wikidata usage, the Usage and the Semantics Dashboard are built having in mind the needs of a particular user who is interested in some specific semantic categories and projects. The WDCM Items Dashboard is a planned component of the system that will enable the user to access the statistics and structural properties of Wikidata usage for particular items. The following two use cases illustrate the ways in which WDCM could be used to learn about some specifics of Wikidata usage from a viewpoint of a fictional but motivated user. Both use cases rely on the functions enabled by the Usage and the Semantics Dashboard. All WDCM dashboards have a Navigate WDCM tab, from which any component of the system can be reached. Also, they all have a Description tab, where a detailed explanation of the dashboard's functionality is found.

حالة الاستخدام أ: قارن الموسوعات الكبيرة

في حالة الاستخدام هذه، نريد المقارنة بين ويكيبيديا الإنجليزية والفرنسية والألمانية والروسية فيما يتعلق باستخدام ويكي بيانات بها، نقسم الرحلة بأكملها إلى عدة خطوات وتحليلات عبر لوحات تحكم مموب.

WDCM Example 5, Step 1. Our first destination is the WDCM Usage Dashboard. On the dashboard's landing page (the Usage Tab under the Dashboard Tab), the right column is dedicated to the study of particular projects. Under Project Report, select enwiki in the Search projects: field.The dashboard will start generating the results; please be patient. Now we can easily start to scroll down and inspect one by one result reported in respect to the Wikidata usage on the English Wikipedia. The first generated reports gives us an overview of Wikidata usage in the particular project. The bar plot to the right represents the volume of Wikidata usage in each of the semantic categories that are currently encompassed by the WDCM analyses. We can see that the items from the categories of Geographical Object, Human, Organization, Taxon, Work of Art, and Wikimedia are predominately used in the English Wikipedia. The summary text to the left of the bar plot says: "enwiki has a total Wikidata usage volume of 6335820 items (4.4% of total Wikidata usage across the client projects).In terms of Wikidata usage, it is ranked 5/789 among all client projects, and 4/301. in its Project Type (Wikipedia)." Let's see what does it mean to have a Wikidata usage volume of 6335820 items in the context of WDCM. The current definition of the WDCM item usage statistics is the following: the count of the number of pages in a particular client project where the respective Wikidata item is used. That means that WDCM has counted, for each item of interest, the number of distinct pages in this project that use (one or many times) a particular item, and the summed up these numbers to obtain 6,335,820. This sum accounts for approximately 4.4% of the total sum that one would obtain from all Wikimedia projects under consideration, and makes the English Wikipedia the fifth most prominent project in terms of Wikidata usage among all Wikimedia projects, as well as fourth among 301 Wikipedia projects - where Wikipedia is its project type, of course. The chart provided immediately bellow provides a context to this ranking of the project under discussion. By repeating this for the French, Russian, and German encyclopedias, we discover that the French Wikipedia accounts for some 2.69% of total usage and is ranked sixth among all the Wikimedia projects, the German Wikipedia accounts for 1.85% of total usage volume and is ranked twelve, while the Russian Wikipedia accounts for 6.26% and is ranked third.

لوحة بيانات استخدام مموب: نمط استخدام ويكي بيانات عبر 14 تصنيف دلالي لويكيبيديا الألمانية (dewiki)

This first step have obtained elementary information and rankings of four projects from the WDCM. A careful analyst might have spotted additional important differences between these four projects by inspecting the first bar plot where semantic categories of Wikidata items are compared for their usage. For example, the Russian Wikipedia tends to use more items from the Architectural Structure category than the English Wikipedia, and much less items from the Wikimedia category. While English Wikipedia uses more items from the category Human than Geographical Objects or Organizations (similar to its German counterpart, dewiki), its French sister shows exactly the opposite pattern of usage.

لوحة بيانات استخدام مموب: ترتيب ويكيبيديا الإنجليزية فيما يتعلق باستخدام ويكي بيانات.

An interested analyst might already have discovered two additional visualizations in the right column of the dashboard page: the Semantic Neighborhood interactive network and the top 30 Wikidata items plot. While the former will be discussed at a later point, the later is rather straightforward: it reports the 30 most frequently used Wikidata item for the currently selected project. In the English Wikipedia, the top 5 are: United States of America (Q30), house mouse (Q83310), brown rat (Q184224), Danio rerio (Q169444), and Drosophila melanogaster (Q130888) - a Wikidata user community with huge interest in biology indeed - while in the German Wikipedia we find: United States of America (Q30), Germany (Q183), United Kingdom (Q145), France (Q142), and black-and-white (Q838368) (monochrome form in visual arts).

لوحة بيانات استخدام مموب: عناصر ويكي بيانات الأكثر شعبية في ويكيبيديا الفرنسية

ومع ذلك، فإن مقارنة المشاريع بهذه الطريقة على لوحة بيانات الاستخدام أمر ممل، لنكتشف ما إذا كانت لوحة معلومات الاستخدام مموب يمكن أن توفر وسائل أفضل للمقارنة بين المشاريع.

WDCM Example 5, Step 2. We are found at the WDCM Usage Dashboard again, but this time we visit the Tabs/Crosstabs Tab and carefully read the introductory instructions for its usage (provided at the top of the Tabs/Crosstabs dashboard tab). In the Search Projects: field we enter ruwiki, enwiki, dewiki, and frwiki, and select all categories in the Select categories: field; click Apply Selection. After some computation and plot rendering, the Dashboard will provide a new set of charts. The first two provide a total Wikidata usage volume per project and per selected category. The third one, immediately bellow, is the least interesting: namely, we have selected four project from the same project type (Wikipedia), so we can eventually learn only that the total Wikidata usage volume in the current project selection is around 21.9 million distinct item/pages. However, the next chart, the large Project x Category cross-tabulation chart, is very informative: it provides an overview of Wikidata usage volume (y-axis) for each project (x-axis) in each semantic category (every sub-panel represents one category).

لوحة بيانات استخدام مموب: تصنيف التبادل الدلالي عبر الجدولة للمشروع س

WDCM Example 5, Step 2 (continued). We could have just selected _Wikimedia (note: mind the underscore) under Select projects: field to retrieve the complete Wikidata usage statistics for all projects of the project type Wikipedia. Since the number of selected projects here is high, the WDCM Usage Dashboard will visualize only the results for the top 30 projects in respect to the total Wikidata usage per project. However, each chart on the Tabs/Cross-tabs tab is accompanied by a Data (csv) button: click the button to download the full dataset for the selection irrespective of what is visualized. So, if you are interested about the big picture of Wikidata usage across the Wikipedia projects that make the most use of it, here it goes:

لوحة بيانات استخدام مموب: استخدام ويكي بيانات في 14 تصنيف دلالي لأعلى 30 مشروع ويكيبيديا فيما يتعلق بحجمها الكلي لاستخدام ويكي بيانات.

لقد تعلمنا الآن كيف نبدأ العمل مع لوحة التحكم في استخدام مموب ونقارن المشاريع فيما يتعلق بحجمها الإجمالي لاستخدام ويكي بيانات، أو استخدام ويكي بيانات الخاص بها في تصنيات دلالية خاصة، لا شيء بعد عن السياقات الدلالية لاستخدامات ويكي بيانات التي نُوقِشت في الأمثلة التمهيدية على هذه الصفحة، دعنا نرى.

حالة الاستخدام ب: ربط المجتمعات

يركز المثال التالي على لوحة البيانات الدلالية: أداة مموب الرئيسية لدراسة سياقات استخدام ويكي بيانات في مختلف التصنيفات الدلالية، سيكون علينا الغوص بشكل أعمق قليلا في المنطق الأساسي لمموب من أجل فهم الطريقة التي يكتشف بها السياق الدلالي لاستخدام ويكي بيانات.

WDCM Example 6, Step 1. Go visit the WDCM Semantics Dashboard, and navigate to Similarity Maps Tab under Dashboard. In the field Select semantic category pick Human. An interactive plot will be generated, presenting a semantic map. Each bubble in the map represents one Wikidata client project (i.e. one Wikimedia project). Projects of different type (Wikipedia, Commons, Wiktionary, Wikiquote, etc) have different colors, with the color legend provided to the right of the map, alongside the tools to interact with it. Hovering any bubble will reveal the respective project name and Wikidata usage details.

لوحة بيانات مموب الدلالية: الخريطة الدلالية لفئة الإنسان (Q5)، كل فقاعة تمثل مشروعا، كلما تم العثور على المشروعين بشكل أقرب، كلما كانا يميلان إلى استخدام العناصر من هذه الفئة الدلالية، يمثل حجم الفقاعة إجمالي حجم استخدام ويكي بيانات في المشروع المعني.

الخريطة الدلالية التي أنشأناها للتو من لوحة بيانات مموب على فحص هياكل التشابه فيما يتعلق باستخدام ويكي بيانات في فئات معينة، يتم تمثيل تشابه المشاريع في هذه الخرائط من خلال المسافة في مستوى ثنائي الأبعاد: كلما كان أقرب إلى أن تم العثور على المشاريع، كلما كان استخدام العناصر أكثر من فئة معينة (human (Q5)، في هذه الحالة) أكثر تشابها، من أجل فهم ما تقوم به لوحة معلومات دلالات مموب; نحتاج إلى توفير نظرة سريعة على الأقل على الأعمال الداخلية لنظام مموب.

How does the WDCM discover the similarity structures in Wikidata usage? As already explained, the elementary Wikidata item usage statistic in WDCM is the count of distinct pages in particular project that make use of the respective item (once or more than once in a page). What follows is that the pattern of Wikidata usage in a particular project can be described by an array of numbers (a vector), with each number representing the usage count for a particular item. In WDCM, each of the considered semantic categories is analyzed separately. We first select only the items from a particular category (for example human (Q5), represented on the map above), then produce their usage counts for each Wikimedia project under consideration, and obtain a matrix in which the rows are indexed by Wikimedia projects (i.e. each row representing one project), and columns by the Wikidata items from the selected semantic category (i.e. each column represents an item). The cells of the matrix are filled with item usage counts. Such matrices can be modeled by the Latent Dirichlet Allocation (LDA), a standard unsupervised learning algorithm in text-mining and Natural Language Processing algorithm that essentially results in the following:

نفترض أن مصفوفة الأعداد يتم إنتاجها بالطريقة التالية:

- هناك مجموعة من الموضوعات الدلالية، كل موضوع يمثل الاحتمالات التي يمكن من خلالها استخدام عناصر ويكي بيانات التي يتم النظر فيها عند استخدام الموضوع المعني نفسه

- يتم تمثيل كل مشروع بمزيج من جميع الموضوعات الدلالية: أي أن كل مشروع يتميز بأهمية كل موضوع من الموضوعات الدلالية المفترضة فيه (ملاحظة: من الناحية الفنية، يتم وصف المشروع من خلال توزيع احتمالي للموضوعات الدلالية، في المقابل، كل موضوع دلالي هو توزيع الاحتمالات على البنود)

- the hypothesized process that generates Wikidata usage in a particular project is the following: (1) randomly pick a semantic topic (say, Celebrities, from human (Q5)) according to the probability of the respective topic being selected in a given project, (2) from the selected semantic topic, randomly pick an item, according to the probability of the respective item to be selected in that semantic topic, (3) "use" the item in the project under consideration.

What the LDA algorithm attempts is to reverse engineer this hypothesized generative process that populates the projects x items matrix for a given number of semantic topics. WDCM runs LDA many times for each semantic category, inspecting solutions across a wide range of semantic topics, until it finds the most satisfactory one according to some quite complex criteria of statistical learning (see: Bayes factor). Once the optimal solution is selected, the algorithm returns two matrices that we use in all further WDCM analyses and visualizations:

مشروع المصفوفة الدلالية، حيث يكون لكل موضوع دلالي وزن (أي احتمال) في كل مشروع

the item x semantic topic matrix, in which each Wikidata item has a weight (i.e. a probability) in each semantic topic.

Now each semantic topic is represented by a vector of probabilities (i.e. a probability distribution) across all present items from a particular Wikidata semantic category, and each project by a vector of probabilities (i.e. a probability distribution) across the semantic topics - representing how likely is that a particular item will be used in a project when a particular semantic topic is active in the hypothesized generative process. Caveat: WDCM does not model all items from any of the Wikidata semantic categories under consideration. We select a large number of the most frequently used items from a category simply because modeling items that are rarely used would not improve upon the quality of the LDA solutions in any respect.

Given that semantic topics and projects are represented by probability distributions, we can apply distance metrics upon them (such as the Hellinger distance, or Kullback–Leibler divergence), providing the basis for their visualizations. However, the obtained metric hyperspaces first need to undergo dimensionality reduction in order to be represented in a 2D or 3D spaces. The above semantic map, for example, is obtained from the t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction to 2D from the projects x topics hyperspace in the Wikidata category of human (Q5), following the LDA modeling of the category as described. This dimensionality reduction method is very good at conserving the local similarity structures found in the original hyperspaces, and we can see how bubbles representing Wikimedia projects tend to cluster according to how similarly use Wikidata items in this semantic category.

مثال مموب 6، الخطوة 2: استخدم أداة التكبير من مجموعة الأدوات الموجودة على يمين الخريطة الدلالية لفئة الإنسان، وحدد مجموعة المشاريع الموجودة في الزاوية العلوية اليسرى للخريطة.

لوحة بيانات دلالات مموب: صورة مقربة لمجموعة من المشاريع المجمعة فيما يتعلق بتشابه استخدام عنصر ويكي بيانات من Q5 (إنسان).

من خلال فحص المشاريع في هذه المجموعة، نجد itwiki, nlwiki، plwiki، enwiki،ptwiki، among others، من بين آخرين.

مثال مموب6، الخطوة 3: قم بتغيير علامة تبويب لوحة المعلومات إلى Project Semantics وحدد itwiki وnlwiki وplwiki وenwiki وptwiki، اضغط Apply Selection وانتظر حتى يتم تحديث القطعة، في المخطط الناتج، ابحث عن اللوحة التي تمثل الفئة الدلالية إنسان، يجب أن تكون قادرا على التعلم من هناك أن هذه المشاريع تحصل على تأثير هام من الموضوعات 1 و2 و3 و8، من إجمالي ثمانية موضوعات في فئة عناصر ويكي بيانات هذه، يبدو أن الموضوع 3 هو الموضوع الأكثر تأثيرا في اختيار المشاريع هذا.

لوحة بيانات دلالات مموب: المساهمة النسبية للموضوعات الدلالية لنماذج LDA 14 (نموذج لكل فئة من الفئات الدلالية الـ14 التي تشملها مموب) إلى استخدام ويكي بيانات في enwiki وplwiki وptwiki وitwiki وnlwiki.

ملاحظة: لا تتضمن نماذج LDA لفئات دلالية معينة العدد نفسه من الموضوعات الدلالية، على العكس من ذلك، سيكون من قبيل المصادفة، يمثل المحور السيني على هذه القطعة دائما عدد الموضوعات الموجودة في نموذج LDA للفئة الدلالية التي تضم أكبر عدد من الموضوعات; وذلك لأسباب تتعلق بالتمثيل المتسق للبيانات فقط، على سبيل المثال، يمكننا أن نرى أن نموذج LDA لفئة ويكيميديا لا يشمل سوى أربعة مواضيع، على النقيض من نموذج LDA للفئة الجغرافية التي تشمل عشرة.

WDCM Example 6, Step 4. Now we know that the most important semantic topic in the category Human for enwiki, plwiki, itwiki, ptwiki, and nlwiki is Topic 3 (43.56% relative importance). Change the dashboard tab to Semantic models, select Human in the Select Semantic Category field, and Topic 1 in the Select Semantic Topic field. The first chart on the dashboard page will provide an insight into the 50 most important Wikidata items in this topic of the category Human. The second visualization (scroll down) is an interactive semantic network: each of these top 50 items is represented by a node and points towards the Wikidata item that is most similarly used to it in the respective category of items. The semantic network can help in the interpretation of the topic under consideration. Again: the similarity derived from the WDCM is the similarity in respect to the pattern of Wikidata usage, not necessarily in respect to your expectations based on the meaning of the respective Wikidata items.

لوحة بيانات دلالات مموب: الشبكة الدلالية من أفضل 50 عنصرا من الموضوع 3 من نموذج LDA في الفئة إنسان (Q5).

تحدث أشياء مثيرة للاهتمام في الموضوع 3: الكثير من راكبي الدراجات المحترفين، ثم Johann Sebastian Bach (Q1339) وNikolai Chernykh (Q318611)، عالم فلك سوفيتي وأوكراني، وغيرهم، فرصة أخرى لتذكير أنفسنا بمدى تعقيد مبادئ ويكي بيانات: لا الأخطاء نيابة عن محرري ويكيبيديا ومستخدمي ويكي بيانات ولا ظلال الشك حول مموب مكسورة تشرح دلالات الموضوع 4 في فئة البشر، والفرضية الوحيدة المقبولة هي أن مجتمع المحررين المهتمين بركوب الدراجات يقوم بعمل عظيم في العديد من المشاريع، في الواقع قد يمثلون بالضبط مجموعة من المحررين الذين يمكن لمستخدمي ويكي بيانات الآخرين تعلم الكثير.

مثال مموب6، الخطوة 5: أخيرا، تمثل آخر مؤامرة على صفحة لوحة المعلومات 50 مشروع ويكيميديا تأثر بشكل كبير بهذا الموضوع في الفئة الدلالية إنسان (Q5).

لوحة بيانات دلالات مموب: أبرز 50 مشروع ويكيميديا بارز في الموضوع 3 من نموذج إنسان LDA (Q5).

والآن تعلمنا شيئا عن وجود اتجاه محدد لاستخدام ويكي بيانات في فئة human (Q5) لخمسة مشاريع ويكيميديا (itwiki، plwiki، ptwiki، nlwiki، enwiki) من لوحة بيانات دلالات مموب الرئيسية.

س: ماذا نفعل بهذه النتائج؟

أ: ماذا يُقصَد بمموب:

لقد حددنا للتو مجموعة من المشاريع كان يبدو أن المحررون يشاركونها نفس الاهتمامات في فئة معينة

لماذا لا تسأل عما إذا كان هؤلاء المحررين متصلين ويمكنهم أن يتعاونوا ويتبادلوا معارفهم وخبراتهم

حددنا أيضًا مجموعة من المشاريع كان هذا السياق الدلالي نفسه مهما، بخلاف المشاريع الخمسة التي كنا مهتمين بها في البداية

اسأل كم يستخدمون ويكي بيانات في الفئة الدلالية ذات الصلة، وربط المحررين من المشاريع المتخلفة والمتطورة للتعاون والتعاون.

التركيز على حل مؤامرة ويكي بيانات للدراجات.

لوحات معلموات مموب

يقدم هذا القسم وصفا موجزا لجميع لوحات تحكم مموب المتاحة حاليا، يمكن العثور على نفس المعلومات في علامة تبويب الوصف في كل لوحة معلومات.

نظرة عامة على لوحة معلومات مموب

مقدمة

The WDCM Overview Dashboard presents the big picture of Wikidata usage; other WDCM dashboards go into more detail. The Overview Dashboard provides insights into (1) the similarities between the client projects in respect to their use of of Wikidata, as well as (2) the volume of Wikidata usage in every client project, (3) Wikidata usage tendencies, described by the volume of Wikidata usage in each of the semantic categories of items that are encompassed by the current WDCM edition, (4) the similarities between the Wikidata semantic categories of items in respect to their usage across the client projects, (5) ranking of client projects in respect to their Wikidata usage volume, (6) the Wikidata usage breakdown across the types of client projects and Wikidata semantic categories.

نظرة عامة على الاستخدام ويكي بيانات

يتم تقديم بنية التشابه في استخدام ويكي بيانات عبر مشاريع العميل، كل فقاعة تمثل مشروع عميل، يعكس حجم الفقاعة حجم استخدام ويكي بيانات في المشروع المعني، يتم تجميع مشاريع مماثلة فيما يتعلق بدلالات استخدام ويكي بيانات معا.

The bubble chart is produced by performing a t-SNE dimensionality reduction of the client project pairwise Euclidean distances derived from the Projects x Categories contingency table. Given that the original higher-dimensional space from which the 2D map is derived is rather constrained by the choice of a small number of semantic categories, the similarity mapping is somewhat imprecise and should be taken as an attempt at an approximate big picture of the client projects similarity structure only. More precise 2D maps of the similarity structures in client projects are found on the WDCM Semantics Dashboard, where each semantic category first receives an LDA Topic Model, and the similarity structure between the client projects is then derived from project topical distributions.

بينما تعرض علامة التبويب Explore تصوير {Rbokeh}ديناميكيا إلى جانب الأدوات لاستكشافها بالتفصيل، تعرض علامة التبويب Highlights قطعة {ggplot2} ثابتة مع أهم مشاريع العميل المميزة (ملحوظة يتم فقط تصنيف أعلى خمسة مشاريع (من كل نوع مشروع) فيما يتعلق بحجم استخدام ويكي بيانات).

اتجاه استخدام ويكي بيانات

يتم تقديم بنية التشابه في استخدام ويكي بيانات عبر الفئات الدلالية، تمثل كل فقاعة فئة ويكي بيانات دلالية، يعكس حجم الفقاعة حجم استخدام ويكي بيانات للفئة المعنية، إذا تم العثور على فئتين في القرب، فإن هذا يعني أن المشاريع التي تميل إلى استخدام هذه الطريقة تميل إلى استخدام الفئة الأخرى، والعكس، وبالمثل للنظرة العامة على الاستخدام، يتم الحصول على رسم ثنائي الأبعاد عن طريق إجراء تقليل أبعاد t-SNE لفئة المسافات المتجهة الزوجية المشتقة من جدول طوارئ فئات المشروعات س.

توزيع استخدام ويكي بيانات

تساعد هذه القطع في تكوين فهم للنطاق النسبي لاستخدام ويكي بيانات عبر مشاريع العميل، في مخطط ترتيب استخدام الموقع - التردد، تمثل كل نقطة مشروع عميل، يتم تمثيل استخدام ويكي بيانات على المستوى الرأسي ورتبة استخدام المشروع على المحور الأفقي، بينما تتم تسمية المشروع العلوي فقط (لكل نوع مشروع)، ويكشف التوزيع غير المتجانس للغاية وغير المتناظر عن أن جزءً صغيرا من مشروعات العميل لا يمثل سوى نسبة كبيرة من استخدام ويكي بيانات.

في فطعة سجل استخدام المشروع (الترتيب) - سجل (التردد)، يتم تمثيل اللوغاريتمات لكل من المتغيرين، تظل علاقة قانون السلطة صحيحة إذا كانت هذه الفطعة خطية، تتضمن الفطعة أفضل خطية مناسبة، ومع ذلك، لم يتم إجراء أية محاولات لتقدير توزيع الاحتمال الأساسي.

أنواع مشاريع العميل

يتم توفير أنواع المشاريع في صفوف هذا المخطط، بينما يتم إعطاء الفئات الدلالية على المحور الأفقي، يشير ارتفاع الشريط المعني إلى حجم استخدام ويكي بيانات من الفئة ذات الصلة في مشروع عميل معين.

حجم استخدام مشاريع العملاء

استخدم شريط التمرير لتحديد نطاق الرتبة المئوية لتوزيع حجم استخدام ويكي بيانات عبر مشروع العميل للعرض، سيتم ضبط المخطط تلقائيا لعرض المشروعات المحددة بترتيب زيادة استخدام ويكي بيانات، وعرض 30 مشروعا من أفضل المشروعات على الأكثر، ملحوظة: الترتيب المئوي للنتيجة هو النسبة المئوية للدرجات في توزيع ترددها والتي تساويها أو تقل عنها، على سبيل المثال، يُقَال أن مشروع العميل الذي يحتوي على حجم استخدام ويكي بيانات أكبر من أو يساوي 75% من كافة مشاريع العميل قيد النظر في النسبة المئوية 75، حيث أن 75 هو الترتيب المئوي.

في الواقع، يمكنك استعراض التوزيع الكامل لاستخدام ويكي بيانات عبر مشاريع العميل عن طريق تحديد الحد الأدنى والحد العلوي من حيث تصنيف النسبة المئوية للاستخدام.

متصفح استخدام ويكي بيانات

A breakdown of Wikidata usage statistics across client projects and semantic categories. To the left, a table that presents a Client Project vs. Semantic Category cross-tabulation. The Usage column in this table is the Wikidata usage statistic for a particular Semantic Category x Client Project combination (e.g. The Wikidata usage in the category "Human" in the dewiki project). To the right, the total Wikidata usage per client project is presented (i.e. the sum of Wikidata usage across all semantic categories for a particular client project; e.g. the total Wikidata usage volume of enwiki).

استخدام لوحة معلومات مموب

مقدمة

تركز لوحة بيانات استخدام مموب على توفير إحصائيات مفصلة عن استخدام ويكي بيانات في مشاريع شقيقة معينة أو المجموعات الفرعية المختارة منها، تتلقى ثلاث صفحات تقدم نتائج تحليلية في لوحة المعلومات هذه وصفا هنا: (1) الاستخدام، و(2) علامات التبويب/علامات التبويب المتقاطعة، و (3) الجداول.

الاستخدام

توفر علامة التبويب الاستخدام إحصائيات أولية حول استخدام ويكي بيانات عبر الفئات الدلالية (العمود الأيسر) والمشاريع الشقيقة (العمود الأيمن).

To the left, we first encounter a general overview of Basic Facts: the number of Wikidata items that are encompassed by the current WDCM taxonomy (in effect, this is the number of items that are encompassed by all WDCM analyses), the number of sister projects that have client-side Wikidata usage tracking enabled (currently, that means that the Wikibase/Schema/wbc entity usage) is present there), the number of semantic categories in the current version of the WDCM Taxonomy, and the number of different sister project types (e.g. Wikipedia, Wikinews, etc).

The Category Report subsection allows you to select a specific semantic category and generate two charts beneath the selection: (a) the category top 30 projects chart, and (b) the category top 30 Wikidata items chart. The first chart will display 30 sister projects that use Wikidata items from this semantic category the most, with the usage data represented on the horizontal axis, and the project labels on the vertical axis. The percentages next to the data points in this chart refer to the proportion of total category usage that takes place in the respective project. The next chart will display the 30 most popular items from the selected semantic category: item usage is again placed on the horizontal axis, item labels are on the vertical axis, and item IDs are placed next to the data points themselves.

The Categories General Overview subsection is static and allows no selection; it introduces two concise overviews of Wikidata usage across the semantic categories of Wikidata items. The Wikidata Usage per Semantic Cateory chart provides semantic categories on the vertical and item usage statistics on the horizontal axis; the percentages tells us about the proportion of total Wikidata usage that the respective semantic category carries. Beneath, the Wikidata item usage per semantic category in each project type provides a cross-tabulation of semantic categories vs. sister project types. The categories are color-coded and represented on the horizontal axes, while each chart represents one project type. The usage scale, represented on the vertical axes, is logarithmic to ease the comparison and enable practical data visualization.

To the right, an opportunity to inspect Wikidata usage in a single Wikimedia project is provided. The Project Report section allows you to select a single Wikimedia project and obtain results on it. The first section that will be generated upon making a selection provides a concise narrative summary of Wikidata usage in the selected project alongside a chart presenting an overview of Wikidata usage per semantic category. The next chart, Wikidata usage rank, show the rank position of the selected project among other sister projects in respect to the Wikidata usage volume. Beneath, a more complex structure, Semantic Neighbourhood, is given. In this network, or a directed graph if you prefere, each project points towards the one most similar to it. The selected projects has a different color. The results are relevant only in the context of the current selection: the selected project and its 20 nearest semantic neighboors only are presented. Once again: each project points to the one which utilizes Wikidata in a way most similar to it. Thetop 30 Wikidata items chart presents the top 30 Wikidata items in the selected project: item labels are given on the vertical axis, Wikidata usage on the horizontal axis, and the item IDs are labeled close to the data points themselves.

علامات التبويب/علامات التبويب المتقاطعة

Here we have the most direct opportunity to study the Wikidata usage statistics across the sister projects. A selection of projects and semantic categories will be intersected and only results in the scope of the intersection will be returned. The charts should be self-explanatory: the usage statistic is always represented by the vertical axis, while the horizontal axis and sub-panels play various roles in the context of whether a category vs project or a category vs project type crosstabulation is provided. Data points are labeled in million (M) or thousand (K) pages (see Wikidata usage) definition above). While charts can display a limited number of data points only, relative to the size of the selection, each of them is accompanied by a Data (csv) button that will initiate a download of the full respective data set as a comma separated file.

الجداول

يقدم القسم جداول وجدولة قابلة للبحث والفرز مع دلالات توضيحية ذاتية الوضوح، الوصول إلى مجموعات بيانات الاستخدام الكاملة لمموب من هنا.

لوحة معلومات دلالات مموب

مقدمة

The WDCM Semantics Dashboard is probably the central and the analytically most complicated of all WDCM Dashboards. Here we provide only the necessary basics of distributional semantics needed in order to understand the results of semantic topic modeling presented on this WDCM dashboard. A user who needs to dive deep into the similarity structures between the Wikimedia sister projects in respect to their Wikidata usage patterns will most probably have to do some additional reading first. However, the Dashboard simplifies the presentation of the results as much as possible to make them accessible to any Wikidata user or Wikipedia editor who is not necessarily involved in Data or Cognitive Science. Reading through the WDCM Semantic Topic Models section in this page is highly advised to anyone who has never met semantic topic models or distributional semantics before.

نماذج مموب الدلالية الموضوعية

قراءات مقترحة

Distributional Semantics. In Wikipedia. Retrieved October 24, 2017, from https://en.wikipedia.org/wiki/Distributional_semantics

Topic model. In Wikipedia. Retrieved October 24, 2017, from https://en.wikipedia.org/wiki/Topic_model

Latent Dirichlet allocation. In Wikipedia. Retrieved October 24, 2017, from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Dimensionality reduction. In Wikipedia. Retrieved October 24, 2017, from https://en.wikipedia.org/wiki/Dimensionality_reduction

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g.enwiki, frwiki, ruwiki, etc). WDCM thus employes various statistical approaches in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering,dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course).

Wikidata Usage Patterns. The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size Nof particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be desribed by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived covariance/correlation matrix - many insigths into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found.

في جوهرها، تعتمد التكنولوجيا والرياضيات التي تعتمد عليها مموب على نفس مجموعة الأدوات والأفكار العملية التي تدعم تطوير محركات البحث الدلالي وأنظمة التوصية، والتي يتم تطبيقها فقط على مجموعة بيانات محددة تشمل أنماط الاستخدام لعشرات الملايين من كيانات ويكي بيانات عبر مشاريع العميل.

لوحة معلومات: نماذج دلالية

Each of the 14 currently used semantic categories in the WDCM Taxonomy of Wikidata items receives a separate topic model. Each topic model encompasses two or more topics, or semantic themes. Here you can select a semantic category (e.g. "Geographical Object", "Human") and a particular topic from its model. The page will produce three outputs: (1) the Top 50 items in this topic chart, which presents the 50 most important items in the select topic of the selected category's topic model, (2) the Topic similarity network, which presents the similarity structure among the 50 most important items in the selected topic, and (c) theTop 50 projects in this topic chart, where 50 Wikimedia projects in which the selected topic plays a prominent role in the selected semantic category.

لوحة معلومات: دلالات المشروع

Make a selection of Wikimedia projects here and hit Apply Selection. The Dashboard will produce a series of charts, one per each Wikidata semantic category that is present in your selection of projects, and compute the relative importance (%) of each topic in the given selection and for each semantic category. Do not forget that category specific semantic models do not necessarily encompass the same number of topics (in fact, they rarely do); also, Topic n in one category is obviously not the same thing as Topic n in some other category.

لوحة معلومات: خرائط التشابه

عند اختيار فئة دلالية، ستقدم لوحة البيانات خريطة ثنائية الأبعاد تمثل أوجه التشابه بين مشاريع ويكيميديا المحسوبة من النموذج الدلالي المحدد للفئة فقط، هنا يمكنك معرفة كيف تشبه أو تختلف عن المشاريع الشقيقة فيما يتعلق باستخدامها عناصر ويكي بيانات من فئة الدلالي واحد.

ملاحظات المستخدم

أية ملاحظات على استخدام مموب موضع ترحيب وستكون محل تقدير كبير.

يمكنك ترك تعليقاتك هنا في صفحة نقاش المشروع، أو التواصل مباشرةً مع غوران إس. ميلوفانوفيتش، عالم البيانات، ويكيميديا ألمانيا، الذي يحافظ حاليا على نظام مموب.

في المستقبل القريب، سيتم تنظيم الاكتشافات المعرفية مع متطوعين لتحسين إمكانية استخدام لوحات بيانات مموب، سيتم أيضا استخدام وسائل أخرى لجمع تعليقات المستخدمين.

قد ترغب أيضا في متابعة المناقشات المتعلقة بمموب على القائمة البريدية لويكي بيانات.

إذا كنت ترغب في مناقشة أو المساعدة في تحسين الجوانب الفنية لنظام مموب، فاتترك تعليقا على مشروع ويكي التكنولوجيا الصفحة طريق للذهاب فيه.

كيف تساهم

تقديم تعليقات المستخدمين أمر ضروري لتطوير الأنظمة التحليلية مثل مموب، مشاركة خبراتك مع مموب أمر في غاية الأهمية. فكر: صفحة نقاش المشروع.

If you are an R developer and a Cognitive/Data Scientist willing to contributed to the development of the WDCM System, contact Goran S. Milovanović, Data Scientist, Wikimedia Deutschland.

المساهمة الأكثر فائدة التي يمكن أن نتخيلها في الوقت الحالي هي مشاركة خبراتك في تفسير نتائج مموب التي تم الحصول عليها لأي أغراض تحليلية ربما تكون قد حصلت عليها.

في حال كانت لديك أية فكرة عن الطريقة التي ترغب في المساهمة بها في نظام مموب وغير مدرجة هنا، اتصل بمطور النظام أو ببساطة اترك تعليقا هنا على صفحة نقاش المشروع.