§ ¶Problem with SSE4.1 support
I have the day off from work today, so I decided to sit down and try some SSE4.1 experiments... only to discover the following:
- Intel VTune 6.1 doesn't work at all on this laptop. (Yeah, it's ancient, but it's reliable, fast, and worked great with a Pentium M, SSE2, and VC8.)
- AMD CodeAnalyst 2.76 works in timer mode, but can't disassemble past SSE4.1 instructions.
- VC8 (VS2005 SP1) can't assemble SSE4.1 instructions in inline assembly.
- MASM 8 can't assemble SSSE3 or SSE4.1 instructions.
- The VS2005 SP1 toolchain can't disassemble SSSE3 or SSE4.1 instructions.
- VC9 (VS2008) handles SSE4.1... but I only have the Express version, which doesn't have MASM, and I can't switch to VS2008 anyway for VirtualDub 1.8.x due to system requirements issues.
- Agner Fog's pentopt tome hasn't been updated for Penryn, and I can already tell that a number of SSE2 instructions are quite a bit faster than on the original Core 2 Duo.
- Intel's x86 Optimization Guide has been updated for Penryn, but the formatting is almost unusable and they don't include ľop breakdowns for memory ops -- which means that I can't tell, for instance, whether LDDQU and MOVDQU are improved.
It's not a dealbreaker, since using MASM macros isn't bad and I'm used to RDTSC profiling, but sheesh... talk about missing prerequisites.
1. There is VTune 9.0+ available
4. It can (SSSE3) if you use macros from Agner Fog [http://www.agner.org/optimize/macros.zip]
6. AFAIK you have latest MASM in Windows Server 2008 SDK
As for missing prerequisites, hopefully it won't be the same with AVX:
Igor Levicki (link) - 22 04 08 - 09:45
I've used VTune 9.0... aside from the fact that it would cost me $700 to get a personal license for that version, I get the impression that it sucks. It kept crashing or giving profiles with half the symbols resolved that were mysteriously fixed by a re-link. For some reason you also can't set the initial columns and it keeps coming up with useless data like segment address, which is lame.
Unfortunately, Agner doesn't have SSE4.1 macros. I took a shot at doing them and came to the conclusion that they'd be difficult to implement cleanly for certain instructions, most notably PMOVZXBQ, due to the lack of similar instructions.
I'd rather not require the Windows Server 2008 SDK, since you have to hack it in order to get a clean build with VS2005 SP1.
As for AVX... I'd love to play around with that, since two address instructions suuuuuuck, but they have to release a CPU with it first and then I'd have to get one. I didn't realize the AVX reference was out, but looking at it.... yeesh. I mean, the stuff they're adding is cool, but the x86 mess just got even bigger. You've now got a 256-bit load called VMOVDQA (note that the DQA part originally meant double quadword aligned or 128-bit), and absurdly long instruction mnemonics like VBROADCASTF128. I'm just glad my disassembler is parser based because Intel just made the x86 instruction stream more annoying with the addition of a VEX prefix.
Phaeron - 23 04 08 - 01:22
oops, i mistakingly posted this on a newer blog post, so here goes again:
ID software had a similar problem (needed sse3 support before their compilers had it). They solved it with clever use of regular VC macros:
OfekSH - 23 04 08 - 09:43
Hey, AVX looks interesting, I could kill for 256 bit wide regs and fp madd :P
Gabest - 23 04 08 - 20:15
Me too. Soon they will release software emulator for AVX.
Igor Levicki (link) - 27 04 08 - 14:59
"MASM 9.0 is included in Visual C++ 2008 SP1 Express Edition."
VC++ 2008 SP1 link: http://www.microsoft.com/downloads/detai..
JuanR - 31 01 09 - 12:55