Partially needed to avoid build breakages because Corrade itself
switched as well, partially because a cleanup is always good. Done
except for (STL-heavy) code that's deprecated or SceneGraph-related APIs
that are still quite full of STL as well.
What's especially nice is that the code snippets no longer need to
describe that there's "2 lights, 3 materials and 5 draws" because now
it's self-documenting.
In cases when specular highlights are not desired, results in 30%
speedup (on Intel) and ~25% speedup on AMD, compared to setting the
specular color to transparent black.
Testing was easy thanks to already having a ground truth image for this
case.
After several failed attempts to make UBO performance not suck on Intel
Mesa and Windows drivers, I ended up hiding the dynamic aspect under a
flag. That way it's still possible to get the proper perf in UBO
workflows that don't do light culling, and for workflows where light
culling matters the 2x slowdown might be still better than looping
through several extra lights that don't contribute anything.
So it's all having the same workflow. This one results in even more
saved UBO slots per-draw than in the case of Flat, and the slowdown on
Intel is as bad as expected.
While it's one additional indirection (that has an extra cost on Intel
GPUs apparently, like with Phong and MeshVisualizer and
DistanceFieldVector already), with the assumption that draws usually
share the material info it allows to cram more draws into the 16/64k UBO
limit as the per-draw data are now one vec4 smaller.
For the indirection overhead I can imagine adding a new flag which makes
material mapping implicit (materialId == drawId). That seems to put the
benchmark numbers back to the original speed. Same could be done for
other shaders.
Interestingly, shaders that have indirect material references are about
2x slower on Intel. Not the Flat or Vector, which contain the full
material in the DrawUniform. Will probably need extra
Intel-specific optimizations (like avoiding the indirection if
MATERIAL_COUNT=1).
Took me a while (several years?) to figure out a way to benchmark this
without basically duplicating the testing effort and without new
variants being too hard to add.