TODO: well, the compiler generates the same code as I do with AVX2, so
there's really no benefit. But could act as a sandbox.
TODO: properly handle cases of element count not being a multiple of 16
TODO: implement other variants
TODO: implement a pack version
[ci skip]