Skip to content

Optimize guest-to-guest sync-to-sync management of async task related infrastructure #12311

@alexcrichton

Description

@alexcrichton

This is a meta/tracking issue about remaining work necessary to optimize the guest-to-guest sync-to-sync adapter generated by Wasmtime when component-model-async is enabled. Some more historical discussion of this happened at #wasmtime > Wasmtime sync<->sync adapter optimizability @ 💬 as well, and I'll try to keep this up-to-date.

What is the problem

Wasmtime will compile an "adapter" with the FACT compiler when one guest component calls another. With the advent of component-model-async this adapter has a large number of permutations, for example the caller could be sync/async lowered, the callee could be sync/async lifted, and the function type itself could be sync or async. This specific issue is about the single case of a sync lowered caller, sync lifted callee, and sync function type. This doesn't mean the other permutations should be ignored, but that's the most interesting case for now.

Additionally with the advent of component-model-async it's required, spec-wise, to manage async-task-related-infrastructure when crossing component boundaries. Task infrastructure comes into play in a number of scenarios, such as:

  • When a task calls an imported function, that creates a new task. This new task has the current task as a parent task.
  • Intrinsics such as backpressure.{inc,dec} modify the backpressure counter in the current task.
  • When a task exits/returns all of its pending subtasks are "reparented" to the task's own parent.

Effectively, there's substantial infrastructure pieces that may be used across component boundaries, and thus Wasmtime needs to handle this. This leads us to the problem: with component-model-async disabled this task management is all ignored as it's not applicable, but with component-model-async enabled this task management is enabled. This means that the sync<->sync adapter will call a host function to manage task infrastructure pieces.

This cost of this hostcall is relative to the situation of the adaptation being performed, but the goal of sync<->sync adapter is to, ideally, compile to a grand total of 0 instructions. Given that it's impossible to optimize away a call into the host, this issue is thus about the problem of solving the task infrastructure management problem without actually making a host call. This should restore the prior-to-component-model-async behavior of a sync<->sync adapter compiling to pure optimizable CLIF which mostly boils away.

History and Current Status

As of the time of this writing Wasmtime doesn't actually do any manipulation of task infrastructure on sync<->sync adapters. This is a bug and results in issues such as #12128 (plus many undocumented others we have since realized). @dicej will soon have a PR to fix this situation where task infrastructure will be maintained across these boundaries.

The plan is to have a PR which will enhance the sync<->sync adapter with task infrastructure management, conditionally. The condition will be based on whether the component-model-async wasm feature is enabled in the Config. This is intended to be a stopgap because embedders should not need to disable features for performance. For the time being though it'll retain the pre-p3 performance profile of sync adapters while retaining p3-relatevant spec compliance.

Future plans for optimization

Enabling Cranelift to compile these adapters to zero instructions is going to require special care and a number of refactorings of Wasmtime's task infrastructure in addition to new Cranelift optimizations. The general rough idea for the implementation is:

  • A new VMAsyncTask type will be added. Fields this will contain are:
    • A "kind", more relevant in a moment
    • Fields for context.set {0,1}
    • A parent pointer for the parent task. Option<NonNull<VMAsyncTask>>
    • Backpressure fields (if necessary still, we've talked about removing backpressure)
    • A flag of whether this task can block or not.
  • The Rust-based "full" async task will contain this field as well as any other tables and such necessary. This will be similar to VMContext vs vm::Instance, for example.
  • The current task will be stored in VMContext or VMComponentContext (maybe both? unsure?)
  • Sync<->sync adapters will allocate, on the stack, a VMAsyncTask with just these fields. This will be initialized with the current task and then the current task will be set to this.
  • Manipulations of the current task will go directly through VMAsyncTask if applicable, e.g. context.{g,s}et {0,1}
  • Manipulations of the current task that require Rust data structures, for example adding a subtask, will "promote" the task from the stack to the Rust heap. This will go back through the entire chain of tasks and promote them all to the heap most likely too.
  • Returning from a sync<->sync adapter will restore the current task to its previous value.

Effectively, at a high level, sync<->sync adapters will allocate a task on the stack that, if necessary, will get promoted to the Rust heap to perform more expensive maniuplations on. In essence Rust-level tasks are lazily created only as necessary for "more complicated" things, like spawning subtasks, while low-level actions like context.get will remain efficient.

The resulting CLIF for a sync<->sync adapter will pseudo-code look like:

void adapter(vmctx *vmctx) {
    vmtask *prev_head = vmctx->current_task;
    vmtask stack_node;
    stack_node->kind = VMTASK_STACK;
    // ...
    stack_node->prev = prev_head;
    vmctx->current_task = &stack_node;

    the_callee_component(vmctx);

    vmctx->current_task = prev_head;
}

If the_callee_component(vmctx) is small enough the theory here is:

  • Cranelift will see that vmctx->current_task is loaded, stored to, then stored to with the previous value. If the_callee_component(vmctx) has no obviously aliasing regions, then it can eliminate both stores as dead.
  • If the_callee_component doesn't actually do anything like call the host then Cranelift will see that all the stores to stack_node are unused, so they're all eliminated.
  • If all the previous loads/stores were eliminated, then the load from vmctx->current_task is also dead, so that's also eliminated.

I don't believe that Cranelift will perform all of these optimizations, but my understanding so far is that this is well within Cranelift's complexity budget and wheelhouse to implement optimizations like these.

Expected Timeline

The current plan is to ship the hostcall-to-manipulate-task-infrastructure with WASIp3 originally. Embeddings that need the highest performance on sync<->sync adapters will disable the component-model-async runtime feature (and maybe compile time feature). After WASIp3 ships and we have enough time to come back to this and design this all "for real" we'll implement this. At that point it won't matter if engines turn the component-model-async feature on-or-off, it'll be the same.

Another point to note here is that it's expected that in WASIp3 Wasmtime will need to pretty heavily optimize calls to context.{get,set}. This work, while not the same as optimizing get/set, is highly related and will likely be a prerequisite for this work. That's to say that this work isn't solely motivated by sync<->sync adapters, but instead it's motivated by other routes too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    After-P3

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions