"Hey guys, I've been looking to squeeze every last bit of performance out of my C++ code, especially for crypto-intensive tasks. I've been experimenting with inline assembly and intrinsics, and I'm curious to know if anyone else has had success with this method. Specifically, what are some gotchas to watch out for and any particularly effective techniques to share?"