In cases when specular highlights are not desired, results in 30%
speedup (on Intel) and ~25% speedup on AMD, compared to setting the
specular color to transparent black.
Testing was easy thanks to already having a ground truth image for this
case.
After several failed attempts to make UBO performance not suck on Intel
Mesa and Windows drivers, I ended up hiding the dynamic aspect under a
flag. That way it's still possible to get the proper perf in UBO
workflows that don't do light culling, and for workflows where light
culling matters the 2x slowdown might be still better than looping
through several extra lights that don't contribute anything.
While (of course) having zero effect on a single-light scenario, with
five lights it saves about 10% in the classic uniform case (on Intel).
Not bad (also, FFS, what the compiler is doing if it's not able to
optimize this?!).
Hm, I wonder why I did it this way, the operation isn't really heavy to
benefit from the hardware interpolators, and with a lot of lights we'd
hit the maximum output count in the vertex shader.
With classic uniforms this seems to be ~5% faster on both Intel and AMD
cards. With UBOs this is ~15% (!) faster on AMD (I guess because the
constant UBO access overhead is moved to just one stage instead of
both?) but slower on Intel (of course, sigh... I assume due to UBO reads
being slow and so when done for every fragment instead of every vertex it
costs more?). Since this benefits the *real* GPUs while the card that
was already awful is more awful I don't think that's a big deal. This
change stays.
"Luckily", thanks to the DRAW_COUNT=1 and MATERIAL_COUNT=1 optimizations
not everything blows up, so i don't need to skip absolutely everything,
unfortunately Phong lights are affected by this insane crapfest as well
so basically nothing from Phong UBO support is tested there. FFS.
Unlike the drawId optimization before, there's no possibility to
check this anywhere, so the assumption is just documented.
On an Intel 630 this resulted in further significant reduction for the
single-draw single-material case, down to 260 from 440 in the previous
commit, about a 45% reduction compared to the original 550 ms; multidraw
case is still around the 550 there.
This is always true in the single-draw case, since setDrawOffset()
asserts on this. In the multi-draw case this optimization doesn't make
sense, because it doesn't make sense to create a multidraw shader with
just one draw.
On an Intel 630 GPU this resulted in single-draw single-material Phong
to go from 550 ms to 440, which is roughly a 20% improvement. For the
simpler shaders the difference is even higher. The multidraw numbers
stayed the same as before, obviously.
So it's all having the same workflow. This one results in even more
saved UBO slots per-draw than in the case of Flat, and the slowdown on
Intel is as bad as expected.
While it's one additional indirection (that has an extra cost on Intel
GPUs apparently, like with Phong and MeshVisualizer and
DistanceFieldVector already), with the assumption that draws usually
share the material info it allows to cram more draws into the 16/64k UBO
limit as the per-draw data are now one vec4 smaller.
For the indirection overhead I can imagine adding a new flag which makes
material mapping implicit (materialId == drawId). That seems to put the
benchmark numbers back to the original speed. Same could be done for
other shaders.
These deliberately share the same binding (because there's very little
space), but the shader wasn't guarding that. Discovered completely by
accident when adding tests for "multidraw with all the things" -- Mesa
gives just a warning, but ANGLE straight out fails the shader
compilation, so better have an assert there.
Besides expanding the tested platform set and updating thresholds where
needed, it makes more sense to list what is tested than what is not,
because when I forget to update the list it looks like I tested while I
did not.
I just put this aside when I discovered the error, thinking it was a
Mesa bug. Now that ARM Mali yelled about the same, I realized it wasn't
just Mesa.
Note to self: Mesa has no bugs. Can you just finally accept that?!
That feeling when you lose three hours debugging STRANGE shader compiler
issues that happen only on ES, seeing stuff like "unexpected HASH_TOKEN
at line 140" or "unterminated ifdef" on just any compiler you try, and
then you spot THIS. FFS.
Apparently this is how I was porting shaders in 2013, but not all, I was
mostly sane, wrapping things in a nice ifdef EXPLICIT_UNIFORM_LOCATION,
except this one case in b9a72bd3d1 where I
temporarily went full retard. No idea why.
Interestingly, shaders that have indirect material references are about
2x slower on Intel. Not the Flat or Vector, which contain the full
material in the DrawUniform. Will probably need extra
Intel-specific optimizations (like avoiding the indirection if
MATERIAL_COUNT=1).
It probably didn't matter as much as the only platform without
ARB_explicit_uniform_location is Mac, which doesn't have
ARB_shading_language_420pack either.