From 4bd485d894a61b040adcd9b5589b62ec111546d4 Mon Sep 17 00:00:00 2001 From: Boris Kolpackov Date: Mon, 22 Apr 2024 11:57:51 +0200 Subject: Specify new GitHub notification semantics --- mod/mod-ci-github.cxx | 112 ++++++++++++++++++++++++++++++++++++++----------- mod/tenant-service.hxx | 22 +++++----- 2 files changed, 99 insertions(+), 35 deletions(-) diff --git a/mod/mod-ci-github.cxx b/mod/mod-ci-github.cxx index 6db5202..3ff2330 100644 --- a/mod/mod-ci-github.cxx +++ b/mod/mod-ci-github.cxx @@ -37,31 +37,6 @@ // https://en.wikipedia.org/wiki/HMAC#Definition. A suitable implementation // is provided by OpenSSL. -// @@ TODO LATEST PROBLEMS -// -// 1) GH allows following transitions: -// -// queued <--> building -// queued --> built -// building --> built -// -// I.e., GH does not allow transitions away from built. So this simplifies -// things for us. -// -// 2) Create check run does not fail if an CR with the same name already -// exists: instead it destroys the existing check run before creating -// the new check run (with a new node ID). And to a GH user this appears -// exactly like a transition from built to building/queued. -// -// So if the first notification's data has not yet been stored, before -// creating the CR we first need to check whether it already exists on GH -// and then create or update as appropriate. -// -// For build_queued() we can get the list of check runs in a check suite, -// 100 max at a time (pagination), so it can be done in N/100 -// exchanges. And using GraphQL the response would be much smaller (return -// only the CR name). -// // @@ TODO Centralize exception/error handling around calls to // github_post(). Currently it's mostly duplicated and there is quite // a lot of it. @@ -382,6 +357,93 @@ namespace brep return true; } + // Build state change notifications (see tenant-services.hxx for + // background). Mapping our state transitions to GitHub pose multiple + // problems: + // + // 1. In our model we have the building->queued (interrupted) and + // built->queued (rebuild) transitions. We are going to ignore both of + // them when notifying GitHub. The first is not important (we expect the + // state to go back to building shortly). The second should normally not + // happen and would mean that a completed check suite may go back on its + // conclusion (which would be pretty confusing for the user). + // + // So, for GitHub notifications, we only have the following linear + // transition sequence: + // + // -> queued -> building -> built + // + // Note, however, that because we ignore certain transitions, we can now + // observe "degenerate" state changes that we need to ignore: + // + // building -> [queued] -> building + // built -> [queued] -> ... + // + // 2. As mentioned in tenant-services.hxx, we may observe the notifications + // as arriving in the wrong order. Unfortunately, GitHub provides no + // mechanisms to help with that. In fact, GitHub does not even prevent + // the creation of multiple check runs with the same name (it will always + // use the last created instance, regardless of the status, timestamps, + // etc). As a result, we cannot, for example, rely on the failure to + // create a new check run in response to the queued notification as an + // indication of a subsequent notification (e.g., building) having + // already occurred. + // + // The only aid in this area that GitHub provides is that it prevents + // updating a check run in the built state to a former state (queued or + // building). But one can still create a new check run with the same name + // and a former state. + // + // (Note that we should also be careful if trying to take advantage of + // this "check run override" semantics: each created check run gets a new + // URL and while the GitHub UI will always point to the last created when + // showing the list of check runs, if the user is already on the previous + // check run's URL, nothing will automatically cause them to be + // redirected to the new URL. And so the user may sit on the abandoned + // check run waiting forever for it to completed.) + // + // As a result, we will deal with the out of order problem differently + // depending on the notification: + // + // queued Skip if there is already a check run in service data, + // otherwise create new. + // + // building Skip if there is no check run in service data or it's + // not in the queued state, otherwise update. + // + // built Update if there is check run in service data and its + // state is not built, otherwise create new. + // + // The rationale for this semantics is as follows: the building + // notification is a "nice to have" and can be skipped if things are not + // going normally. In contrast, the built notification cannot be skipped + // and we must either update the existing check run or create a new one + // (hopefully overriding the one created previously, if any). Note that + // the likelihood of the built notification being performed at the same + // time as queued/building is quite low (unlike queued and building). + // + // Note also that with this semantics it's unlikely but possible that we + // attempt to updat the service data in the wrong order. Specifically, it + // feels like this should not be possible in the ->building transition + // since we skip the building notification unless the check run in the + // service data is already in the queued state. But it is theoretically + // possible in the ->built transition. For example, we may be updating + // the service data for the queued notification after it has already been + // updated by the built notification. In such cases we should not be + // overriding the latter state (built) with the former (queued). + // + // 3. We may not be able to "conclusively" notify GitHub, for example, due + // to a transient network error. The "conclusively" part means that the + // notification may or may not have gone through (though it feels the + // common case will be the inability to send the request rather than + // receive the reply). + // + // In such cases, we record in the service data that the notification was + // no synchronized and in subsequent notifications we do the best we can: + // if we have node_id, then we update, otherwise, we create (potentially + // overriding the check run created previously). + // + function (const tenant_service&)> ci_github:: build_queued (const tenant_service& ts, const vector& builds, diff --git a/mod/tenant-service.hxx b/mod/tenant-service.hxx index 2b76b2d..8e419a4 100644 --- a/mod/tenant-service.hxx +++ b/mod/tenant-service.hxx @@ -52,16 +52,18 @@ namespace brep // While the implementation tries to make sure the notifications arrive in // the correct order, this is currently done by imposing delays (some // natural, such as building->built, and some artificial, such as - // queued->building). As result, it is unlikely but possible to be notified - // about the state transitions in the wrong order, especially if processing - // notifications can take a long time. To minimize the chance of this - // happening, the service implementation should strive to batch the queued - // state notifications (of which there could be hundreds) in a single - // request if at all possible. Also, if supported by the third-party API, it - // makes sense for the implementation to protect against overwriting later - // states with earlier. For example, if it's possible to place a condition - // on a notification, it makes sense to only set the state to queued if none - // of the later states (e.g., building) are already in effect. + // queued->building). As result, it is unlikely but possible to observe the + // state transition notifications in the wrong order, especially if + // processing notifications can take a long time. For example, while + // processing the queued notification, the building notification may arrive + // in a different thread. To minimize the chance of this happening, the + // service implementation should strive to batch the queued state + // notifications (of which there could be hundreds) in a single request if + // at all possible. Also, if supported by the third-party API, it makes + // sense for the implementation to protect against overwriting later states + // with earlier. For example, if it's possible to place a condition on a + // notification, it makes sense to only set the state to queued if none of + // the later states (e.g., building) are already in effect. // // Note also that it's possible for the build to get deleted at any stage // without any further notifications. This can happen, for example, due to -- cgit v1.1