Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ SIMD intrinsics, revisited

In the previous entry, a commenter asked if SIMD intrinsics are worthwhile in VS2008.

Truth be told, I didn't try them, because Microsoft only has a skeleton crew (person?) on the C++ compiler for VS2008, and they're not even taking most bug fixes, much less a feature addition or optimization like improving SIMD code generation. The rest of the compiler team is busy rewriting the compiler for Orcas+N. As such, I don't really expect any change in intrinsics compared to VS2005 SP1, which in turn is just VS2005 RTM + some new kernel mode intrinsics. I do have some experience working with intrinsics in other venues, though, so I can at least tell you my experiences with VS2005.

The first problem you'll run into with intrinsics is that they require alignment. If you construct all of your SIMD code to use unaligned loads and stores, your performance will be pathetic. The heap alignment on Win32 is only 4 bytes, though, so you need to use _aligned_malloc(), with its associated space penalties, or switch to a custom allocator. The compiler does handle alignment of sub-objects for you, and in theory it does for stack objects as well, but my experience is that VC8 is buggy with regard to returning aligned objects and frequently gets it wrong. Fortunately, x86 gives you a clear exception when this occurs; some platforms instead helpfully align the pointer for you by zeroing LSBs of the address, which leads to some nice heap corruption bugs. If you're interoperating with .NET, you're in for some annoyance because the CLR knows jack about alignment. STL can also give you problems if its allocators aren't alignment-savvy; I think VC8's implementation might be problematic in this regard.

The second problem is MMX, or more specifically, the prohibition on mixing x87 and MMX. This isn't a performance issue -- you will actually get incorrect results if you mix the two without appropriate [F]EMMS instructions, because the FPU will start spitting out NaNs when it notices its register stack is full. VC7 had some severe bugs with the optimizer rearranging floating point calculations around _mm_empty() or __asm { emms } statements and nearly made it impossible to safely use MMX intrinsics. I think these were fixed in VC8, but then you have the problem of when to do it. The last thing you want to do is call EMMS at the end of each and every function in a library, because performance will be dreadful, and trying to document which ones use MMX and forcing the client to figure out where to put the barriers is really bad too. And if you think MMX is dead, do consider that unless you have SSE2, it's really hard to efficiently handle integers, even if you just want to convert them to and from floats (well, unless you only want to do one at a time and only 32-bit integers).

The third problem is the ABI. More specifically, the x86 ABI wasn't designed with SIMD in mind, so it has none of the features that would help. The stack isn't aligned, so the compiler has to generate code to create an aligned stack frame -- although I've heard that LTCG can help in this regard by eliminating this in nested calls. Perhaps more annoying is that there is no convention for preserving SSE registers or passing floats in SSE registers, so the compiler tends to bounce values out to memory and possibly through the x87 stack, even if /arch:SSE is used. This is especially distressing if you're writing a math library -- which you would think is a natural use for SSE intrinsics -- until you discover that the vector and float portions of the compiler don't talk to well to each other.

The fourth problem that I have with VC's intrinsics is that I sometimes find them harder to use -- x = _m_paddw(x, y) isn't much better than PADDW x, y, and I find the _mm_epi32_add() style particularly ugly. I've seen intrinsics code that looked like it was just translated line-by-line from assembly code, which basically just meant it was slower and uglier. They get more usable if you wrap them in operators, but then you end up with lots of function calls that impede debugging and make your debug builds suck. And isn't it supposed to be the compiler's job to wrap instructions in a higher level form??

I should note that the x64 versions of Windows avoid a number of these issues, as the platform is guaranteed to support SSE2 and the ABI was designed with that in mind. However, with x64 being very poorly supported and Microsoft trying its best to drive it into the ground with stupidity like the signed driver requirement in Vista x64, I've almost written it off entirely.

Truth be told, I'd love to ditch assembly and use intrinsics, but I find it hard to tolerate these flaws. SIMD makes the most difference in code that is performance critical and that means it's also the code that can least tolerate flaws in the compiler's output. I also tend to run into non-SIMD issues whenever I consider the switch, because there are a lot of missing scalar intrinsics. For instance, in a lot of my scaling code I use 32:32 fixed point, where the 32-bit halves are joined by the carry flag and thus I can use the upper half directly without needing shift ops. C++ doesn't have support for the carry flag and VC++'s __int64 code generation sucks (why would you change <<32 into *2^32???). External precision arithmetic is also very difficult to do with the provided intrinsics, to the point that I had to write a silly three-line assembly routine in an .asm file just to do MulDiv64() on x64. It seems like any new scalar intrinsics are being added just for the NT kernel team and not really for anyone else -- the new intrinsics that were added in VS2005 SP1, for instance, are essentially useless in user mode.

As a side note, when I tried Intel C++ 6.0, it did generate very nicely optimized MMX code, but it also bloated code by about 30%. In the end, I gave up supporting it because I was tired of tracking down compiler-induced bugs like thrown exception objects being destroyed twice and misgenerated STL code. I haven't tried GCC yet... it probably would do somewhere between VC++ and Intel C++ codegen-wise and probably more stably than Intel C++. Sadly, it's also hands-down the most annoying compiler on the planet.


Comments posted:

I wish inline assembly wasn't completely different between VC and GCC, or that I could at least use standalone assembly on Windows and Linux without either requiring NASM or needing some tool to convert between the different assembly syntax.

> Sadly, it's also hands-down the most annoying compiler on the planet.

Mind commenting on why? For inline assembly, there's no argument--while the concept of letting the compiler allocate your registers in inline assembly, and telling it what you're modifying so it doesn't have to assume everything changes and turn every assembly block into an optimization brick wall--is neat and great when it works, it breaks too frequently and always in mysterious, time-consuming ways.

Other than that--and once I set up warnings the way I like, as with any compiler--it doesn't really seem any more or less annoying than VC for C++ code to me. (Well, VC brazenly lying about standard C APIs being "deprecated" would put it as more annoying, but I'll classify dumb things that I can disable under "setting up warnings". In that battle, VC loses badly--I have over a dozen dumb warnings disabled, not to mention having to fix definitions like isnan() being called _isnan, missing lrintf() and friends, and the ridiculous "100i64" syntax. Those just don't bother me in typical use where I've already worked around them.)

Glenn Maynard - 30 08 07 - 12:58

Your blog software has an allergy to dashes. That may also be more annoying than gcc ...

Glenn Maynard - 30 08 07 - 13:02

Avery, I'd love to read your comments on AMDs proposed SSE5, with 3-operand instructions, fused mul-add, etc.

eloj (link) - 30 08 07 - 16:04

A bit OT, but my brother is heavily involved in the WinCE port of SCUMMVM (the ARM part), and told me that by just porting the source from VC2005 to the gcc4 toolchain gave the source a 30% speed boost, so (at least for the ARM archtecture) I wouldn't say that gcc is doing a bad job.

ggn - 31 08 07 - 02:31

Yeah, GCC assembly is indeed a lot more powerful than Borland/VC++ syntax, but writing nontrivial amounts of inline asm in strings is painful. I'd hate to write inlines like that.

The warnings in VC8 are indeed annoying, but I'd note that they're in the standard library and not the compiler itself. GCC has a couple of really boneheaded warnings, the most annoying of which is "no newline at end of file." There isn't a standard warnings mechanism in the compiler, and this warning appears not to have a disable. Sure, technically the standard requires a newline, but nearly all other compilers don't, and it's lame to have a non-disableable warning for it. I think Apple hacked one into their build. My experience has been that the GCC compiler team likes to break code for standards compliance reasons without paying much attention to the practicalities of breaking changes and realizing that everyone can't just change their codebases all the time, particularly when legacy or cross-compatibility concerns are involved. GCC also has a spotty history in terms of adopting existing conventions, such as #pragma once, which I think was called deprecated in GCC for a while until they fixed it, even though it was in widespread use.

As for SSE5, I haven't looked at it yet, but it looks promising at first glance.

Phaeron - 31 08 07 - 02:31

Oh, I'd agree that GCC can generate better code. Certainly the VC++ team hasn't been putting as much effort lately into improving codegen. That doesn't mean GCC isn't more annoying. :)

Phaeron - 31 08 07 - 02:49

AFAIK, GCC does work with MASM, right? While GCC can be a bit annoying, its quite powerful. Though currently I am not having much luck porting this VS2005 solution to use C::B+GCC....

King InuYasha (link) - 01 09 07 - 00:55

Also, MinGW has released a Tech Preview of MinGW using GCC 4.2.1. This new version has a bucketload of improvements against the GCC 3.4.x versions...

King InuYasha (link) - 01 09 07 - 01:03

VC9 has support for SSE3/SSSE3/SSE4.1/SSE4.2 and SSE4A intrinsics.

GCC doesn't need #pragma once because it recognizes the idempotency dance:

Stephan T. Lavavej - 01 09 07 - 01:08

Where's my damn __roundtoint() intrinsic? :)

Just because #pragma once isn't needed doesn't mean it wouldn't be helpful to support it... especially since VC++ doesn't have compiler support for accelerating header guards. It's a bit like static on declarations; just because you can use anonymous namespaces doesn't mean that static is useless. Although... I've heard that #pragma once isn't that effective with VC++ either, and that only external guards (around the #include) are really effective.

Phaeron - 01 09 07 - 02:54

The "newline at end of file" warning is supposedly for preprocessors that #include files blindly, where the lack of a newline will make the last line join with the next line in the included file, but I agree it should have a disable (all warnings should), and if that's the only reason for it, it belongs in -pedantic. (It's on by default, without even -Wall or -W? That must be an oversight.)

The only thing on upgrade that I've found annoying is being nitty about "typename", but it may actually have good parser reasons for that (I guess that let them fix some bugs, though I didn't care since I didn't hit any of them), so I blame C++ for templates being warty and not GCC. Other than that, I can't think of much; my major complaint with GCC is that it's incredibly slow compared to VC. I'd sacrifice some performance for its build times.

Glenn Maynard - 01 09 07 - 05:57

You only mention Intel C++ 6.0 when you mention Intel's compiler in your posting...
Does that mean it's the only version you have tried?

With Intel C++ being at v10.0 today I'm wondering how many of the complaints you had about Intel C++ are still valid, if any.

Kitten - 02 09 07 - 09:52

Very late comment, but I should point out that a newline at the end of file is a requirement as per the C++ standard. GCC 4.3 will actually generate a hard error for it. Also, what's so annoying about it? Just add that stupid newline.

Sebastian Redl - 03 12 07 - 07:33

Perhaps you haven't worked on a large project where GCC is only one of the compilers in use and code is external.

You are correct, requiring a newline at the end of a compilation unit is a requirement of the C++ standard. So is having all asm() code in a string literal, having static const integral members of a templated class also defined at file scope, two-phase name lookup, and export. And yet, you will find many, many compilers that do not comply to all of these rules and many, many KLOC of existing code that also doesn't. When it comes to C++ compliance, having a newline at the bottom of the file is 4.4826E+20th on the list of things I care about.

An important rule when you work on a large code base with external code is that you do not unnecessarily change external code unless either you can push the changes back or are willing to deal with the merge mess. If I were managing such a situation there is no way in hell I would ever approve changing every single file to fix a stupid problem like this. That the GCC team made this an non-disableable warning was obnoxious to begin with and to make it an error is worse. Can you imagine any commercial compiler vendor deliberately deciding to piss off all of their customers like this over such a trivial issue? Why is this even worth the effort compared to everything else that could be done?

Phaeron - 04 12 07 - 00:20

Avery, Intel C++ is at version 10.1.013. Perhaps you could find some bugs in generated code (like I do from time to time) but if you don't bother to report them over at Premier support, they won't get fixed anytime soon.

Developers keep demanding better and more reliable compilers but they are always reluctant to participate in the fixing process. It is really a shame, you can't expect them to test all *your* corner cases.

Igor (link) - 07 01 08 - 22:43

I'm afraid I don't have a currently active support license for Intel C++, and I'm not really willing to pay for one. I require that all code compiles with Visual C++ and for perf-critical code I rewrite in asm anyway.

Intel C++ 6.0 had a lot dumber bugs than the ones I mentioned above. If you requested a short branch in inline assembly and the branch target was too far away, it issued a warning and then compiled a random branch into your code instead. With all of the issues I wasn't fond of the compiler and the EH crash bug was the final straw. I have much better things to do than track down compiler bugs all of the time.

Phaeron - 07 01 08 - 22:55

On that signed driver issue on x64, note that the signing process for that does not require WHQL certification, just a certificate from one of the vendors approved by Microsoft.

Yuhong Bao - 09 02 08 - 19:41

On the matter of x64 however, the most important reason to use intrinstics is you must use them in x64 code.

Yuhong Bao - 09 02 08 - 19:45

In the end however, things like PatchGuard aren't as much as a barrier to x64 applications as things like compatiblity with 32-bit systems.

Yuhong Bao - 09 02 08 - 19:48

To sign x64 drivers, you must have a Verisign class 3 code signing certificate or equivalent. It's been reported that Verisign will not issue such certificates to individuals, and thus Microsoft is essentially shutting out the little guy from writing drivers. Microsoft also got Verisign to revoke the certificate used by Atsiv, because it allowed circumvention -- even though all of its users understood perfectly what it did, installed it deliberately, and wanted it. So now we're also in a situation where Microsoft can have a driver revoked because it doesn't like what the driver does.

Tell me -- if this is for my security, how come I, as the administrator of my machine, don't get to choose which drivers can run? Why can't I add and remove certificates as I please? So I'm not allowed to run a tool that I want like Atsiv, but the kernel will still blindly accept something like Sony's famous rootkit driver, just because it's signed by Sony? What the hell?

Phaeron - 10 02 08 - 16:35

"Itís been reported that Verisign will not issue such certificates to individuals, and thus Microsoft is essentially shutting out the little guy from writing drivers."
To be honest, you are probably right unfortunately. But my point is that, those companies that have the certificates to do WHQL testing also can use the same certificates to do x64 driver signing as well without WHQL testing.
"but the kernel will still blindly accept something like Sonyís famous rootkit driver, just because itís signed by Sony? "
Note that digital certificates are for identification and verification only and is no gureentiee of the behavior of the software. The user is supposed to decide which vendors to trust. Atsiv was a driver that defeated driver signing, that is why the certificates was revoked. As long as the driver does not defeat driver signing the certificates used for signing will probably not be revoked.

Yuhong Bao - 13 02 08 - 21:48

Ah, I see. Identification and verification only. Where exactly was Atsiv deceptive about where it came from and where it did? And it was in support of the user's choice of trust when Microsoft unilaterally had Atsiv's certificate revoked so that it could not load into the kernel, against the wishes of people who had intentionally installed it on their own machines for their own purposes? Remember, these people KNEW what Atsiv did and installed it on PURPOSE on their OWN machines.

Tell me, where in the UI can I find the button that says, "I recognize this driver, and I want to allow it to run." Oh wait, there isn't one. I'm sorry, but this argument is BS. If it were about user choice, then there would be a well-documented way to exclude drivers from the requirement, instead of only arcane methods that require tools from the DDK and also happen to disable some functionality in the system.

Phaeron - 13 02 08 - 23:19

>Where exactly was Atsiv deceptive about where it came from and where it did?
Nothing, it is just that it allowed loading drivers that WERE deceptive about where it came from and where it did. That is my point, in fact.

Yuhong Bao - 20 02 08 - 12:38

Isn't there some way with the C++ preprocessor to add a newline to the end of every file? Or how about a macro that adds a newline after every include line? That would solve the problem (but not eliminate the warnings, I assume).

krick - 02 05 08 - 11:59

Comment form

Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.

Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.