Analyze operator-level performance across a representative set of devices, shapes, models, and thread counts.
- Document the following:
- Performance trends and optimal tuning for thread counts.
- How does the optimal value shift with GEMM shape?
- Percentage of time where cores are blocked waiting for work.