"Hey guys, looking for some help optimizing an assembly language routine in x86-64. I've got it working, but it's got some serious performance hits on larger inputs. Anyone have some tips for minimizing registers, using SIMD, or just generally squeezing out some cycles?"