Rethink Event URIs
We badly new a consistent and nice way to generate URIs for events
Blueprint information
- Status:
- Complete
- Approver:
- Seif Lotfy
- Priority:
- Essential
- Drafter:
- Mikkel Kamstrup Erlandsen
- Direction:
- Needs approval
- Assignee:
- Seif Lotfy
- Definition:
- Superseded
- Series goal:
- Accepted for 0.3
- Implementation:
- Informational
- Milestone target:
- 0.3.0
- Started by
- Completed by
- Mikkel Kamstrup Erlandsen
Related branches
Related bugs
Sprints
Whiteboard
BLUEPRINT IS SUPERSEDED: Events will not have URIs, we use sequence numbers as event ids (uint32).
I would suggest an event uri to look the following
namespace / eventclass / eventtype / timestamp / subject_id / actor_id
e.g:
http://
*** Approval:
Seif: I think this is essential: +1 for approval
Markus: +1 ( - kamstrup infered this from Markus' comments)
kamstrup: +1
RainCT: +1
***Discussion
kamstrup: I don't think we should use http for the uri scheme. I also ver much like the idea of human-parseable timestamp strings. Also I don't think we can expose the *_ids in the URI. We should not guarantee these to be stable externally. We have URIs after all. Ids are for fast internal lookups.
---
seb: Hi guys,
URIs should be suitable to serve as a reliable identification of a resource, also (ideally) in a global scale. Therefore i think a URI for an event needs to contain all the information that is required in order to address an event globally (a primary key), that is: device, application, time and type (class).
However, uniquly identifying the device is tricky since this is usually done by a URL itself which must not be necesserily unique and containing a URL inside a URI is not really elegant. I do not think, that identifying the device will be a big issue in the short run, but may be helpful especially when thinking about Teamgeist and RDF. So, I've been thinking about that issue this evening and came up with the following proposal:
A device should get an unique (numeric) identifier. Many people would think about using a MAC address, however, this is not a optimal solution since one device may have 0-n network cards built in and therefore may not even have a MAC address; although this is really seldom. For this reason i propose to just generate a sufficiently large random number (a UUID) which just identifies one device, no matter which network technology it is using.
Concerning the URI scheme: a URI that identifies a Zeigeist event should have a own scheme identifier. 'http' is not really suitable for this since it's scheme specific part is defined to address hypertext documents and not events. Events are purely virtual and will never manifest in an adressable document. Therefore, I propose to either use an own schema - i.e. zg (Zeitgeist), or use a URN. Implementing a URN could look like this:
urn:zg:<schema version>:<device uuid>:<application id>:<timestamp>
i.e.
urn:zg:
the device UUID could be easily translated to a way more detailed description of the device, but still be globally unique. The stupid thing about this is, that the URN is quite long; however, in order to ensure that event URNs are externally stable without any further processing, this is unavoidable. Another problem with this proposal is, that the class identifiers for events should also be URIs (http://
seb: After reading all this again, i came to the conclusion that an event URN does not have to contain any human readable information at all. Basically, the only requirement is that the URN provides a _unique_ identifier which could entirely be provided by a UUID itself:
urn:zg:
---
Seif : I like my proposal for 2 reasons:
1) the uri generated can allow us to make sure that if a dataprovider sends us the same event twice (such as firefox or so trying to reinsert its history) we cab check if it was already there before
2) allows us to ban events from going out or being registered
---
seb: You can assure
1) by computing the uuid from the event data. i.e. take the device + application + time + whatever and make a sha1 hash out of it. This way, the same event will always get the same URI.
2) you should to this not by parsing a URI, but from the metadata that is associated with it. This is because if you want to implement new filtering techniques or add new metadata to an event, you do not have to change the events URI. This is more flexible and stable in the long term.
---
Seif: can you guys PROVIDE a written example?
what about zg:<actor_
zg:firefox.
I think this would be a good example and also very readable for the human eye for debugging purposes later :)
This event describtion ensures that no 2 of the same event could be ever inserted into the DB
---
kamstrup: Guys - what is the reason to have such long complicated URIs? I haven't seen any one explain _why_ their proposals are good. The only reason I can see for complex URIs is debugging, which I find a rather bad reason to waste disk space and DBus bandwidth. A good URI should have the following traits:
* Be unique (doh!)
* Be as short as possible
* Facilitate debugging (hereunder be human-parseable)
* Be fast to generate
What I proposed on the linked wiki-spec-page fulfill all these requirements:
zg://<ISO8601 timestamp>
It is unique because the sequence number ensures we can have any number of events withing the same millisecond. It is short because it only includes the timestamp. It facilitates debugging because the timestamp is human readbable *and* sorts in timeline order with standard ascii string sorting. It is faster to generate than a UUID or a longer string.
---
kamstrup: @seb - It worries me a bit that you suddenly want to make the URIs globally unique in the sense of global="the entire world". I mean, file://
kamstrup: @seif: Filtering, or otherwise handling events purely by parsing their URI seems like a bad idea to me. Given a URI Zg has all metadata ready at hand anyway, so why encode everything in a URI?
---
RainCT: I agree with kamstrup so far.
---
seif: @kamstrup what do u mean by <#sequence number> how can i ensure that i dont get the same event form firefox again (reinserting history) and that it is not inserted since it was already there before!
---
seb: @kamstrup: i think i explained the reasoning behind my proposal and why i think it's good in my first post. but anyway: i propose the URIs to globally unique because you may want to work with URIs from different computers at the same time; think of Teamgeist (read my first post) and of set operations on RDF graphs. In this regard, your proposal is only unique within one machine, but would generate the same URI on different machines for different events. I ask myself, why you want to program an extra logic layer just to map local URIs to global scope if you can have this for free? (efficiency?)
To recapitulate in your terms, my proposal is:
- unique (globally)
- short (constant length)
- fast to generate (thinking of modern cpu capacities, generating a hash is very efficient)
I question myself what a timestamp and some sequence _number_ will tell me in case of debugging a probably complex problem. In most cases you will want to copy the URI into a query just to get more information. In my opinion, providing sophisticated and comfortable debugging information is the purpose of an application / database framework and _not_ of the identifier itself. This also has the advantage that you can provide as long and sophisitaced information as you want. What is missing in my last proposal in this regard is a version number, which provides information about how the identifier is being computed:
urn:zg:
---
kamstrup: @seif - I can't and neither can you :-) The only way to do true duplicate detection is to let the application provide some unique id for the event. Consider for example two mouse clicks on the same coordinate within the same millisecond...
kamstrup: @seb - Whether or not to use completely opaque ids or not is really the same as Bazaar rev. numbers vs. Git rev. hashes. I am a Bazaar guy - I take you are a Git guy :-)
Anyways, back to the topic of globally unique URIs. I am not sure what the right context to make URIs unique in is... There are several options which all have pros and cons:
- One per person per computer (store one UUID per user per computer)
- Several per person per computer (several UUID keys per computer, just like I can have several PGP/SSH keys)
- Per person (fx. tie the URI to an email account, OpenId, or what ever)
We should also consider cooperating with Nepomuk - there was a *lengthy* thread on the Nepomuk list about this very topic a while ago: http://<email address hidden>
My biggest concern here is that I feel a big "Here Be Dragons"-sign hanging over all of this globally-unique-ids deal. There are just a lot of unresolved problems (as the Nepomuk thread also highlights very good). It is not hard to come up with a bunch of schemes that would be globally unique in some sense. The tricky part is getting it "unique in the right way". I fear that this could considerably slow down Zeitgeist development if we don't get it right off the bat.
---
Seif: @kamtrup: actually I acn I can detect duplicates if i generate the uri according to some given variables of the event such as zg:<timestamp>
this allows me to lookup in the DB if an event with this uri already exists. And there can only be one event with that uri because only one thing can happen to a subject from a the same actor at the same timeinstance: A double click would get through. But reinserting history won't. And for me this is the only way to filter it out
---
Markus: If launchpad had a veto button for blueprints, I would use it for the idea of having globally unique event URIs, so -1 from me for this proposal.
I prefere Mikkel's solution,
zeitgeist:
is unique for one user, is simple enough to generate and provides all information we need.
---
Seif: After a long talk with Mikkel we ended uo with this solution. We will use Mikkels uri suggestion like here
zeitgeist:
These uris are generated on the engine side.
However we will add a "Bouncer" unit that will make sure that no uri is generated for an event of which the same "timestamp, subject and application" already exist. The Bouncer in my opinion should settle in the very from of the insert_event method. Although it will consume some performance ( 1 select per insert ) when we look at the normal case we can assume that maximum 10 events happen in one second. It is rare that we try to insert 200 events unless it is on startup.
Work Items
Dependency tree
* Blueprints in grey have been implemented.