Obi Official Forum

Full Version: Burst*Batch job optimisations
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
I think I found a 2nd potential perf boost for the Burst backend, mearused using a profiler sample BurstConstraintsImpl.Project() call like so:
Code:
        public JobHandle Project(JobHandle inputDeps, float deltaTime)
        {
            UnityEngine.Profiling.Profiler.BeginSample("Project");

            var parameters = m_Solver.abstraction.GetConstraintParameters(m_ConstraintType);

            switch(parameters.evaluationOrder)
            {
                case Oni.ConstraintParameters.EvaluationOrder.Sequential:
                    inputDeps = EvaluateSequential(inputDeps, deltaTime);
                break;

                case Oni.ConstraintParameters.EvaluationOrder.Parallel:
                    inputDeps = EvaluateParallel(inputDeps, deltaTime);
                break;
            }

            UnityEngine.Profiling.Profiler.EndSample();
           
            return inputDeps;
        }
I've replaced BurstDistanceConstraintsBatch's job generation code to:
Code:
        public void SetDistanceConstraints(ObiNativeIntList particleIndices, ObiNativeFloatList restLengths, ObiNativeVector2List stiffnesses, ObiNativeFloatList lambdas, int count)
        {
            this.particleIndices = particleIndices.AsNativeArray<int>();
            this.restLengths = restLengths.AsNativeArray<float>();
            this.stiffnesses = stiffnesses.AsNativeArray<float2>();
            this.lambdas = lambdas.AsNativeArray<float>();

            projectConstraints.particleIndices = this.particleIndices;
            projectConstraints.restLengths = this.restLengths;
            projectConstraints.stiffnesses = this.stiffnesses;
            projectConstraints.lambdas = this.lambdas;

            applyConstraints.particleIndices = this.particleIndices;

        }

        DistanceConstraintsBatchJob projectConstraints = new DistanceConstraintsBatchJob();

        public override JobHandle Evaluate(JobHandle inputDeps, float deltaTime)
        {
            projectConstraints.positions = solverImplementation.positions;
            projectConstraints.invMasses = solverImplementation.invMasses;
            projectConstraints.deltas = solverImplementation.positionDeltas;
            projectConstraints.counts = solverImplementation.positionConstraintCounts;
            projectConstraints.deltaTime = deltaTime;
           
            return projectConstraints.Schedule(m_ActiveConstraintCount, 32, inputDeps);
        }

        ApplyDistanceConstraintsBatchJob applyConstraints = new ApplyDistanceConstraintsBatchJob();

        public override JobHandle Apply(JobHandle inputDeps, float deltaTime)
        {
            var parameters = solverAbstraction.GetConstraintParameters(m_ConstraintType);

            applyConstraints.positions = solverImplementation.positions;
            applyConstraints.deltas = solverImplementation.positionDeltas;
            applyConstraints.counts = solverImplementation.positionConstraintCounts;
            applyConstraints.sorFactor = parameters.SORFactor;
           
            return applyConstraints.Schedule(m_ActiveConstraintCount, 64, inputDeps);
        }
which returned the following result (Ignore total time, the number of simulations in each case is randomised in my test scene, simulation type is one of 2 types in each case, the 0.05ms/cloth is consistent across multiple tests)
[Image: project-cost.png]
EDIT:
I assume this can be applied across all batch job types to improve the performance further
Hi there,

I assume the jobs are stored as members of the BurstDistanceConstraintsBatch class, right? I tried this approach on distance and bend constraints, then ran all performance tests: doesn't make any measurable difference for me on any scene (ms/frame or FPS). Also used profiler sampling, but timings were pretty similar for both approaches. Storing the jobs as members saved around 0.005 ms/substep on average, but this is hard to say for sure since timing varies between frames. It's certainly not as large as 0.1 ms per projection as your profiler pic shows, maybe there's something else going on?

The only overhead of the current code is creating the job struct once per projection, which should be negligible in most cases as creating a single struct is pretty cheap. Regardless, I will push this change to production as any performance boost is welcome no matter how small. Will ship in Obi 6.2.

Thanks for sharing!
The average time for each projection is 0.03-0.04 ms for me, in the densest cloth sample scene ("Benchmark"). Noticed yours is around 0.2 ms, which is x5 slower.

I could only reproduce this by enabling the jobs debugger, then I get much worse timings (0.15-0.2 ms) and overall considerably slower simulation as expected. The difference between reusing the same job struct and re-creating it is also much larger. Could it be you're running with jobs debugging enabled?
Hi, 

Yeah I stored both jobs above each function that used to re-create them, inside each batch.
I've just retested my setup again, ensuring I run the following burst setup:
[Image: Screenshot-2021-06-07-082259.png]
And got same results (albeit very stable, 0.07-0.08ms for our simplest rig, 0.17-0.18 for larger and 0.21-0.22ms for largest).
I'm running on a Ryzen 1700 so there should be plenty of threads/spare performance.
I'll keep looking poking around as along with Skinning, Cloth is currently the biggest bottleneck in our project.

Will post when I have new findings
Yeah I'm unsure where specifically the performance differences came from in my case. I traced the code around BurstConstraintsImpl.cs and the Evaluate & Apply calls.
I found an interesting difference between PC and Android builds, which I think have something to do with the thread pool of each device.
In the attaced image, the job scheduler is fast enough to queue up jobs much more densly on an android build with 4 worker threads. 
Compared to the PC build, which runs 15 workers, jobs are scattered around a lot more. Looks like it's down to jobs being too small for the scheduler to queue them effectively before they finish work.

This aligns itself well with batching ideas for 7.0 you mentioned in the other thread!
I've played with the BurstSolverImpl.ScheduleBatchedJobsIfNeeded() number:
Going too high, like 128 for the PC build, would cause visible blue "strips" of denser work done, but overall slower Substep time.
While lower values like 4 gives me a strange result where during some frames I get faster substep time, and then slower on others, filled with thread idle time between each job.

(also note the Project samples have improved, so it may have been a bad build with some rogue settings. I have been trashing my checkout with a lot of test-changes last week.. Lengua)

[Image: job-queue.png]