Wikidata:Requests for comment/Domain name as data

An editor has requested the community to provide input on "Domain name as data" via the Requests for comment (RFC) process. This is the discussion page regarding the issue.

If you have an opinion regarding this issue, feel free to comment below. Thank you!

Wikidata appears to have no way to store the domain name associated with an item. There are many properties for URLs. But, while related to URL, domain name is a different value, and, in ways, more significant data.

Unlike an URL, a domain name is:

Regulated and registered (with a registration date).
Bought, sold, and owned by an organization or person.
Immutable (URLs change: e.g, http to https).
Always accessible for active websites (URLs can become 404s).
Used for multiple protocols: web, email, ftp, etc..

Many editors, in thousands of items, have entered the domain name as an Alias. But several other editors have told me that domain name doesn't fit the criteria for Alias well. So where in Wikidata do we store this critical data for many thousands of entities: its domain name (and domain registration date)?

Question How do you define a domain name? If it's mere removal of "http(s)://" and what follows after the ".com/", then it's something that can be easily derived from URL and then I don't see why that should be stored independently... Please explain Vojtěch Dostál (talk) 19:32, 14 January 2024 (UTC)[reply]

Answer @Vojtěch Dostál: A domain name is a string that identifies a realm of administrative autonomy, authority or control. It consists of a top- (TLD) and second- level domain. Domain names identify Internet resources, such as computers, networks, and services. Owners of domain names can map URLs to access specific locations on those resources.

An URL (Uniform Resource Locator) is a reference to a resource that specifies its location on a computer network and a mechanism for retrieving it (e.g., a web, ftp, email or other server). It may contain the domain name along with any string as a subdomain (e.g., www., apps.).

The domain name can be parsed from the URL (using a more complicated regex than you suggest and provided the URL isn't a redirect) but it is a different value and different datatype than an URL. — Hearvox (talk) 20:07, 14 January 2024 (UTC)[reply]

Companies register large numbers of domain names, both as variants to avoid passing off, as blockers, and for possible future projects, so are often commercially sensitive. How public is this information from the registrars now, and how many entries would be obfuscated through brokers. And for big organisations, the registration could be some way from the public website brand. Official_website has the advantage of being a single public entry per object. Recording them would need a new property, which I'd be sympathetic towards. Aliases are a bad solution. – The preceding unsigned comment was added by Vicarage (talk • contribs) at 20:40, 14 January 2024 (UTC).[reply]

Answer 2, an imperfect analogy: An entity often has a 'street address' value, from which you can parse the city. Still, editors always add the city separately in a 'located in the administrative territorial entity' statement. The city, like the domain-name, is administrative; the street address, like the URL, is a location in that admin area.

A domain name is a human-memorizable string that finds the numerically addressed Internet resource, via the Domain Name System registry (which mostly stores domain names, not URLs). Things like URLs and email addresses use a domain-name to access something at a specific location within that domain, e.g., send a webpage or receive an email.

Also unlike an URL, a domain name has a registration date — the day that domain became accessible on the internet, A new property, as @Vicarage suggests, would allow that date as a qualifier (reg. date, BTW, can be an indicator of site reliability). However, reg. date would often be inaccurate for URLs that are https: that protocol was adopted long after many domains were registered. (There's lots of other domain-name specific data: ownership, registrar, webhost — none of which would accurately apply to an URL).

IMO, the wiki that "acts as central storage for the structured data" of all Wikimedia should devote a property to storing domain-name (and its reg. date) as their own discrete values. — Hearvox (talk) 16:37, 15 January 2024 (UTC)[reply]

On a practical note, how would you discover the information. I checked my personal vanity domain and also the club I run in the UK, and in both cases got details of a broker, not the organisation. I then checked cnn.com, and got a broker again. The point of official_website or email domains is that organisations declare its connection to them and their use Vicarage (talk) 05:52, 16 January 2024 (UTC)[reply]

@Vicarage: Registration date is discoverable in the WhoIs databases. Webhost is often detectable via the name server. (When broker, not owner, is listed, we'd determine ownership by inspecting site, as we do for an URL.) So the whois for cnn.com tells us it was registered in 1993, owned by TBS, and webhost-ed by AWS.


Creation Date: 1993-09-22T04:00:00Z
Registrant Organization: Turner Broadcasting System, Inc.
Name Servers: ns-1086.awsdns-07.org
Registrar: NOM-IQ Ltd dba Com Laude

— Hearvox (talk) 19:51, 22 January 2024 (UTC)[reply]

Authoritative name servers for a domain are not necessarily indicative of who's hosting a particular website. There are a lot of layers in-between those. Elizium23 (talk) 20:02, 22 January 2024 (UTC)[reply]

I’m not sure why you’ve created an (as of now, unlisted) RfC, this seems like a property proposal. Apart from data duplication, I also see a modeling issue: White House (Q35525)official website (P856)"https://www.whitehouse.gov" makes sense but White House (Q35525)"domain name""whitehouse.gov" seems somewhat weird because “whitehouse.gov” really isn’t a statement about the White House but rather some sort of technical concept that is only indirectly linked to the White House. --Emu (talk) 17:21, 15 January 2024 (UTC)[reply]
I was counseled by Wikidata editors in Wikidata:Project_chat to create an RFC. As I'm new to Wikidata, I followed their advice.

The domain-name statement would be added, at minimum, for the entity that either owns or is associated with it: "inquirer.com" for the The Philadelphia Inquirer newspaper, "aclu.org" the American Civil Liberties Union nonprofit, "ucla.edu" for the University of California, Los Angeles. Each would have a registration date as qualifier — useful data that would be inappropriate to associate with an URL.

If this belongs in a Property Proposal, not here, I'll move it: That is, if anyone besides me thinks domain-name is an essential bit of data about entities.

. Hearvox (talk) 18:04, 15 January 2024 (UTC)[reply]

RFC seems fine since this isn't yet a property proposal -- it's a bunch of things including motivation, concerns, open questions, the outline of a property proposal, notes of some existing not quite apt options, &c.

Something like "associated domain name" makes sense to me. An amorphous entity like White House might have one official website, but a number of notable associated domains. (whitehouse.gov, lawnevents.gov, &c). Likewise a website (the content or associated org) could have a number of domains over time, as new ones are acquired or old ones expire. Sj (talk) 02:50, 16 January 2024 (UTC)[reply]

Support making this a property proposal. @Hearvox: - I think this is a good idea, thanks for proposing it. ArthurPSmith (talk) 16:29, 22 January 2024 (UTC)[reply]
Support in my work at Wikidata for Web (Q99894727) i often need to work with domains but I can only ever query for urls which makes things so much more complicated. Lets say I want to query for the item that has http://example.com/ as official website (P856). What I have to do is instead query for is:
- http://example.com/ (http, no www, trailing slash)
- http://www.example.com/ (http, www, trailing slash)
- http://www.example.com (http, www, no trailing slash)
- https://example.com/ (https, no www, trailing slash)
- https://www.example.com/ (https, www, trailing slash)
- https://www.example.com (https, www, no trailing slash)
- http://example.com/index.html (http, no www, index)
- http://www.example.com/index.html (http, www, index)
- https://example.com/index.html (https, no www, index)
- https://www.example.com/index.html (https, www, index)

and I am still missing combinations that are common but not common enough to implement them, like language prefixes or non html file extensions: https://www.example.com/en-us/index.php

querying for a simple domain name would be so much easier: *.example.com –Shisma (talk) 18:03, 22 January 2024 (UTC)[reply]

@Shisma: I love that you're working on Wikidata for Web! This seems right up you're alley. Do you have suggestions for how to formulate a related property proposal? Sj (talk) 14:21, 25 January 2024 (UTC)[reply]

I imagine a simple generic property like domain assigned to each item that is associated with a particular domain:

Wikipedia (Q52) → *.wikipedia.org

English Wikipedia (Q328) → en.wikipedia.org

I hope this is how domains work. 😅

url Properties like official website (P856), privacy policy URL (P7101) or URL (P2699) should still remain present. Shisma (talk) 15:21, 25 January 2024 (UTC)[reply]

also organization (Q43229) → *.*.org – Shisma (talk) 15:22, 25 January 2024 (UTC)[reply]

@Shisma: I have the identical problem in matching news outlets with Wikidata QIDs (for a WikiCred project). In many cases it's difficult to programmatically locate a match. In others the problem is too many matches (and a painfully slow query). Searching for "nytimes.com" within 'official website' takes 56 seconds to return 139 results, only one of which I really want (the main NYT item). Whereas a more focused domain-name search, like an exact match of "nytimes.com" as an alias, takes 1 second to return the one correct result.

BTW, your Wikidata for Web (Q99894727) extension is proving quite helpful in this work. — Hearvox (talk) 17:18, 30 January 2024 (UTC)[reply]

a domain property might also be a useful qualifier for URL match pattern (P8966). Currently I have to query all existing patterns which needlesly creates traffic and I also have to run all these regexes. If I could only query for patterns for a particular domain, this would not be an issue – Shisma (talk) 11:08, 3 February 2024 (UTC)[reply]

Comment The DNS system is hierarchical: https://etcsl.orinst.ox.ac.uk can only be said to 'point' somewhere (to an IP address, specifically) because orinst.ox.ac.uk published a statement (a DNS record) saying that it does. Likewise for the relationship between orinst.ox.ac.uk and ox.ac.uk, and between ox.ac.uk and ac.uk, and even between ac.uk and .uk (the TLD). I think a rigorous way to do this would be to have items for each of these technical entities, and a property to link them in a way that models this relationship. This would mean that a domain is considered separate from the content it serves, which is unusual but not unprecedented. Arlo Barnes (talk) 22:38, 22 January 2024 (UTC)[reply]
Related to @Arlo Barnes comment is James Hare's (@Harej) work on the Internet Domains Wikibase. Domain names (nytimes.com), subdomains (cn.nytimes.com), and even subdomain/paths (www.nytimes.com/es} can be separate items (search results: nytimes.com). — Hearvox (talk) 19:45, 24 January 2024 (UTC)[reply]

At the risk of pulling this off the tracks. What about making items to represent domains and then connecting those to the organizations that control them. So we would have an item for "google.com" and maybe a new property that links that to Google (Q95). Without this I'm not sure how we'd handle something like Google (Q95). They operate on like... tons of domains. Do we also list about.google? BrokenSegue (talk) 00:32, 3 February 2024 (UTC)[reply]
@BrokenSegue: Would owned by (P127) work as the connection? — Hearvox (talk) 02:27, 3 February 2024 (UTC)[reply]
very possibly. though things get weird. for example "nytimes.com" is owned by (P127) The New York Times Company (Q2529982) but presumably you want to link it to The New York Times (Q9684). I think it's not entirely clear what kind of relationship you are trying to represent in the general case. BrokenSegue (talk) 02:29, 3 February 2024 (UTC)[reply]

@hearvox: are you going to create a proposal? –Shisma (talk) 11:09, 3 February 2024 (UTC)[reply]

@Shisma: Hard to decide. Some here support a new domain-name property, others suggest domain-names as items, and pretty much everyone here is way-more Wikidata-experienced than me. To propose a property, I'd need time to research best practices, examine examples, etc.. If we went the items route, I'd suggest importing @Harej's Internet Domains Wikibase — which already has domains linked to QIDs (but we'd still need a new 'associated with'-type prop). So, to answer: Dunno. What I do know is:

For news outlets (which I'm working on), it's ridiculously hard to add/update data from external databases into Wikidata. Matching outlets to WD items is near-programmatically impossible, without rigorous, visual, individual inspection of possible matches. Names (labels) and URLs are too variable. Domain-names, however, would solve that (99% of the time).
URLS are derived from domain names, not the other way around. Some folk here think it's fine that in order to get the core, immutable data (domain), you have to parse it from the derived, variable data (URL). To me, that doesn't seem like a solid structured-data approach.
URLs and domain names are different data types: URL is type URL, domain names are strings.
When someone says, "I read it at CNN" (vs. "I saw it on CNN"), they mean at "cnn.com". Domain name is an increasingly common way to precisely identify an information source.
Wikidata's ability to connect to external data, and to allow simple source-identifying machine and human searches, is crippled by its lack of a well-structured way to store domain names and their associations.

Everyone above makes excellent points, but I can't see consensus. So, in sum, I guess this discussion has induced me into inaction. —Hearvox (talk) 15:18, 4 February 2024 (UTC)[reply]