Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ A Visual C++ x64 code generation peculiarity

Take an SSE2 routine to do some filtering:

#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = &table[*indices++ * 2]; acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }

What this routine does is that it uses each index to look up a premultiplied kernel, and adds that to a short output window (8 samples). The output stream has 4x rate compared to the input stream. In a real routine the kernels would typically be a bit longer, but an example of where you might use something like this is to simultaneously upsample and convert a row of pixels or a block of audio through a non-linear curve.

If we look at the output of the VC++ x86 compiler, the result is decent:

0020: 0F B6 01           movzx       eax,byte ptr [ecx]
0023: C1 E0 05           shl         eax,5
0026: 66 0F 6F C1        movdqa      xmm0,xmm1
002A: 66 0F FD 04 38     paddw       xmm0,xmmword ptr [eax+edi]
002F: 66 0F 6F 4C 38 10  movdqa      xmm1,xmmword ptr [eax+edi+10h]
0035: 66 0F 7F 02        movdqa      xmmword ptr [edx],xmm0
0039: 41                 inc         ecx
003A: 83 C2 10           add         edx,10h
003D: 4E                 dec         esi
003E: 75 E0              jne         00000020

However, if we look at x64:

0010: 41 0F B6 00        movzx       eax,byte ptr [r8]
0014: 48 83 C1 10        add         rcx,10h
0018: 49 FF C0           inc         r8
001B: 03 C0              add         eax,eax
001D: 48 63 D0           movsxd      rdx,eax
0020: 48 03 D2           add         rdx,rdx
0023: 41 FF C9           dec         r9d
0026: 66 41 0F FD 0C D2  paddw       xmm1,xmmword ptr [r10+rdx*8]
002C: 66 0F 6F C1        movdqa      xmm0,xmm1
0030: 66 41 0F 6F 4C D2  movdqa      xmm1,xmmword ptr [r10+rdx*8+10h]
      10
0037: 66 0F 7F 41 F0     movdqa      xmmword ptr [rcx-10h],xmm0
003C: 75 D2              jne         0010

It turns out that there are a couple of weirdnesses involved when the x64 compiler hits this code. The x86 compiler is able to fold the x2 from the indexing expression and the x16 from the 128-bit (__m128i) element size into a single x32, which is then converted into a left shift by 5 bits (shl). The x64 compiler is not, and ends up emitting x2 + x2 + x8. Why?

The clue as to what's going on is in the MOVSXD instruction, which is a sign extension instruction. According to the C/C++ standards, integral expressions involving values smaller than int are promoted to int, which in the case of Win32/Win64 is 32-bit. Therefore, the expression (*indices++ * 2) gives a signed 32-bit integer. For the x86 compiler, pointers are also 32-bit and so it just shrugs and uses the signed value. The x64 compiler has to deal with a conversion to a 64-bit pointer offset, however, and seems unable to recognize that an unsigned char multiplied by 2 will never be negative, so it emits sign extension code.

Therefore, we should change the code to remove the intermediate signed type:

#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = &table[*indices++ * 2U]; acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }

Now we are multiplying by an unsigned integer, so the result must be an unsigned int. The x64 compiler now generates the following:

0090: 4C 8B D2           mov         r10,rdx
0093: 66 0F EF C9        pxor        xmm1,xmm1
0097: 45 85 C9           test        r9d,r9d
009A: 74 30              je          00CC
009C: 0F 1F 40 00        nop         dword ptr [rax]
00A0: 41 0F B6 00        movzx       eax,byte ptr [r8]
00A4: 48 83 C1 10        add         rcx,10h
00A8: 49 FF C0           inc         r8
00AB: 8D 14 00           lea         edx,[rax+rax]
00AE: 48 03 D2           add         rdx,rdx
00B1: 41 FF C9           dec         r9d
00B4: 66 41 0F FD 0C D2  paddw       xmm1,xmmword ptr [r10+rdx*8]
00BA: 66 0F 6F C1        movdqa      xmm0,xmm1
00BE: 66 41 0F 6F 4C D2  movdqa      xmm1,xmmword ptr [r10+rdx*8+10h]
      10
00C5: 66 0F 7F 41 F0     movdqa      xmmword ptr [rcx-10h],xmm0
00CA: 75 D4              jne         00A0

Better, but still not quite there. The x64 compiler no longer needs to sign extend the offset, and therefore can now take advantage of the implicit zero extension in x64 when working with 32-bit registers. (New x64 programmers are often confused by the compiler emitting MOV EAX, EAX instructions, which are not no-ops as they zero the high dword.) However, the compiler is still unable to fuse the additions together. A bit of experimentation with the kernel size multiplier reveals that the x64 compiler has an unusual attachment to the trick of doing an x2 add followed by an x8 scale in order to index 16-byte elements. In this particular case there's a possibility that the two adds might be faster than a shift on some CPUs, but with larger multipliers the compiler generates a SHL followed by an ADD, which is never optimal. Therefore, let's take over the indexing entirely:

#include <emmintrin.h>
void filter(__m128i *dst,
const __m128i *table,
const unsigned char *indices,
unsigned n)
{ __m128i acc0 = _mm_setzero_si128(); __m128i acc1 = _mm_setzero_si128(); while(n--) { const __m128i *kernel = (const __m128i *)
((const char *)table + (*indices++ * 32U))
;
acc0 = _mm_add_epi16(acc1, kernel[0]); acc1 = kernel[1]; *dst++ = acc0; } }

Ugly? Definitely, but we're having to work around optimizer shortcomings here. Result:

0060: 41 0F B6 00        movzx       eax,byte ptr [r8]
0064: 48 83 C1 10        add         rcx,10h
0068: 49 FF C0           inc         r8
006B: C1 E0 05           shl         eax,5
006E: 41 FF C9           dec         r9d
0071: 66 0F FD 0C 10     paddw       xmm1,xmmword ptr [rax+rdx]
0076: 66 0F 6F C1        movdqa      xmm0,xmm1
007A: 66 0F 6F 4C 10 10  movdqa      xmm1,xmmword ptr [rax+rdx+10h]
0080: 66 0F 7F 41 F0     movdqa      xmmword ptr [rcx-10h],xmm0
0085: 75 D9              jne         0060

That's better.

Conclusion: check your critical loops after porting to x64, even if they're using intrinsics and don't require fixes. There may be changes in code generation due to both architectural and compiler differences.

Comments

Comments posted:


GCC 4.6.1 x86-64 has the exact same behavior. This coincidence is peculiar: when checking school tests, when two students have the exact same mistake, the teacher suspects cheating.

Z.T. - 29 06 11 - 23:08


Well, if the compiler isn't clever enough to realize that a byte multiplied by two won't overflow into the sign bit then I wouldn't surprised if it couldn't figure out that it can't overflow the 32-bit integer outright either. What happens if you give it a subtler hint by multiplying by 2ULL instead?

As an aside working with (usually less-than-clever) 8-bit embedded compilers gives you something of an eye for spotting potential integer promotion inefficiencies.

doynax - 01 07 11 - 05:53


You still get screwed by type conversion rules in the above code - that's also why both GCC and VC++ agree about the way the address expression is generated. :)

To be more specific, the multiply by 2u is still done as a U32. So what the compiler sees (after the address expression for the table access has been generated) is, when written with explicit types, "U64_from_U32(U32_from_U8(*indices++) * 2u) * 16ull". For general expressions, the compiler isn't allowed to simplify this further; of course in this case we happen to know (because we start from a U8 so the first multiply can't overflow) that the expression is really the same as "U64_from_U8(*indices++) * 32ull". Your modified code (with the casting) takes a third option - it replaces the address expression with "U64_from_U32(U32_from_U8(*indices++) * 32u)" - again we know this can't overflow so it's identical. It still has one redundant conversion in it, but since x86-64 (and most other 64-bit architectures) implicitly zero-extend results of 32-bit computations by clearing the high 32 bits this conversion is free.

Note that you get a "shl eax, 5" not "shl rax, 5". That's because you still work in U32 temps. For a shift of 5 this doesn't matter, but if you have e.g. a shift of 3 you want (which could use the x86 scaled-index address modes) you really need to use the convert-U8-to-U64-then-multiply version.

The underlying problem for all of this is that the C(++) integral types are int-centric and try to do work with ints whenever possible; this was great for 16-bit processors with 16-bit ints or 32-bit processors with 32-bit ints, but now that we have 64-bit processors and use 32-bit ints it causes a lot of friction.

Fabian Giesen (link) - 01 07 11 - 13:05


You're right about a 64-bit cast also fixing the issue -- I had forgotten about that. (I'd cast to size_t instead of unsigned long long, but it's the same here.) However, looking at the C++ standard again, I'm not sure that the U64_from_U32() cast you mention actually exists. We know that it has to happen in order for the instruction addressing to occur, but we're talking about what the standard specifies directly, since the compiler is allowed by bypass any part of it that it can through the as-if rule. The indexing expression a[b] is converted to *(a + b), and since the "usual arithmetic conversions" don't include any special provisions for a pointer being involved, the offset will be promoted only up to int or unsigned int. The pointer+offset addition itself is then governed by 5.7p5, which simply describes the result of addition with a pointer without any description of specific types. I don't see any language that says that the integer value is to be promoted to pointer size prior to addition, and I'm not even sure that's possible given that there may not be such an integer type. This also fits more naturally with CPU architectures that allow indexing with a smaller data type than the base address, 6502 and 68000 coming to mind.

Again, though, there's nothing that prohibits the compiler from generating a single SHL instruction. The "as-if" rule would allow the compiler to remove all of the extraneous casts here according to detected value ranges. What we're really talking about here is just code generation quality.

Phaeron - 02 07 11 - 09:07


"I don't see any language that says that the integer value is to be promoted to pointer size prior to addition, and I'm not even sure that's possible given that there may not be such an integer type."
Correct; the cast to U64 here is not part of C(++) semantics, it's just there in the case of x86-64 (and all other 64-bit platforms that I've written code for, for that matter), where both pointers are GPRs happen to have 64-bit size. That add has to be done *somehow*; either all values involved are promoted to a common size prior to addition (which I'm implicitly assuming in my post), or there is an add instruction that takes mixed-size operands, or there is an addressing mode that allows a smaller-than-pointer-sized index. In the case of x86-64, there's no mixed-size adds and all the address calculations are done in full 64 bits, so it must be the first option.

"Again, though, there's nothing that prohibits the compiler from generating a single SHL instruction."
Yes, in this case; but if you were passing a "const unsigned int *indices" instead, there would be; that's not meant to imply that compilers shouldn't be doing a better job at this, they should. But I do think that the way the C/C++ type promotion rules work out here (doing the multiply on int-sized values), while well-defined and internally consistent, is spectacularly non-intuitive and bound to cause numerous people to unknowingly shoot themselves in the foot; that's unfortunate.

"The "as-if" rule would allow the compiler to remove all of the extraneous casts here according to detected value ranges. What we're really talking about here is just code generation quality."
Yes - it's just that it does require an additional pass of data-flow analysis (tracking upper and lower bounds for arithmetic expressions based on the original types to check whether an overflow might've occurred or not) that wasn't necessary in 32-bit code; we'll get there eventually, but progress on these kinds of optimizations is way slower than we'd like, especially since all the major compilers have been distracted by other issues for the past few years: MS has been pouring most of its resources into .NET for years only recently realizing that this whole "native code" thing might not just go away after all :), GCC is still caught up in a cascade of major infrastructure upgrades to drag their compiler kicking and screaming into this millennium (plus they have this unhealthy culture of language lawyering beating serious practical concerns sometimes), and ICCs main mission still seems to be raking up good SPEC CPU/FP results. LLVM to the rescue? We'll see :)

Fabian Giesen (link) - 02 07 11 - 15:38


Any reason why writing the asm inline directly is not a better idea? It might be harder to read but you could solve that by leaving the C/pseudocode in comments.

Kentaro (link) - 03 07 11 - 12:38


Kentaro: VC++ on x86-64 doesn't support inline ASM at all, only intrinsics (they had to add a bunch of new intrinsics to make that viable). Which is totally moronic, but there you go.

Fabian Giesen (link) - 03 07 11 - 14:57


Here is what Intel Compiler 12.0.4.147 generates for your C code:

test r9d, r9d
pxor xmm0, xmm0
lea eax, DWORD PTR [-1+r9]
je .B1.5
.B1.3::
movzx r9d, BYTE PTR [r8]
movdqa xmm1, xmm0
add r9d, r9d
dec eax
inc r8
shl r9, 4
paddw xmm1, XMMWORD PTR [r9+rdx]
movdqa xmm0, XMMWORD PTR [16+r9+rdx]
movdqa XMMWORD PTR [rcx], xmm1
add rcx, 16
cmp eax, -1
jne .B1.3
.B1.5::
ret

So, is that better than MSVC/GCC or worse?

Igor Levicki (link) - 10 07 11 - 08:58

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.