Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ MMX code generation in Visual C++ 2010 Express

After installing Visual C++ 2010 Express, I decided to try MMX code generation on a whim:

#include <emmintrin.h>
double ComputeVariance(const unsigned char *samples, int quads) { __m64 zero = _mm_setzero_si64(); __m64 one = _mm_set1_pi16(1); __m64 sum = zero; __m64 sumsq = zero;
for(int i=0; i<quads; ++i) { int raw = *(int *)samples; samples += 4; __m64 p = _m_punpcklbw(_m_from_int(raw), zero); __m64 x = _m_pmaddwd(p, one); __m64 x2 = _m_pmaddwd(p, p); sum = _m_paddd(sum, x); sumsq = _m_paddd(sumsq, x2); }
unsigned int isum = _m_to_int(_m_paddd(_m_psrlqi(sum, 32), sum)); unsigned int isumsq = _m_to_int(_m_paddd(_m_psrlqi(sumsq, 32), sumsq));
_mm_empty();
double n = (double)quads * 4; double fsum = (double)isum; double fsumsq = (double)isumsq; return (n*fsumsq - fsum*fsum) / (n*(n-1)); }

This routine uses MMX intrinsics to compute the variance of a series of samples, stored as unsigned bytes. SSE intrinsics got some attention in the VS2010 compiler, but MMX intrinsics have long been the neglected stepchild and I hadn't heard anything about them. Well, let's look at the disassembly:

VS2008 SP1 (VC9) VS2010 (VC10)
00: push ebp
01: mov ebp,esp
03: and esp,0FFFFFFF8h
06: mov edx,dword ptr [ebp+0Ch]
09: pxor mm3,mm3
0C: mov eax,1
11: movd mm0,eax
14: movq mm1,mm0
17: punpcklwd mm1,mm0
1A: movq mm0,mm1
1D: punpcklwd mm1,mm0
20: sub esp,8
23: movq mm4,mm1
26: movq mm1,mm3
29: movq mm2,mm3
2C: test edx,edx
2E: jle 0000005B
30: mov ecx,dword ptr [ebp+8]
33: mov eax,dword ptr [ecx]
35: movq mm5,mm3
38: movd mm0,eax
3B: punpcklbw mm0,mm5
3E: movq mm5,mm0
41: movq mm6,mm4
44: pmaddwd mm5,mm6
47: paddd mm1,mm5
4A: add ecx,4
4D: sub edx,1
50: movq mm5,mm0
53: pmaddwd mm5,mm0
56: paddd mm2,mm5
59: jne 00000033
5B: movq mm0,mm1
5E: psrlq mm0,20h
62: paddd mm0,mm1
65: movd eax,mm0
68: movq mm0,mm2
6B: psrlq mm0,20h
6F: paddd mm0,mm2
72: movd ecx,mm0
75: emms
77: fild dword ptr [ebp+0Ch]
7A: mov dword ptr [esp+4],eax
7E: fmul qword ptr [__real@4010000000000000]
84: fild dword ptr [esp+4]
88: test eax,eax
8A: jge 00000092
8C: fadd qword ptr [__real@41f0000000000000]
92: mov dword ptr [esp+4],ecx
96: fild dword ptr [esp+4]
9A: test ecx,ecx
9C: jge 000000A4
9E: fadd qword ptr [__real@41f0000000000000]
A4: fmul st,st(2)
A6: fld st(1)
A8: fmulp st(2),st
AA: fsubrp st(1),st
AC: fld st(1)
AE: fsub qword ptr [__real@3ff0000000000000]
B4: fmulp st(2),st
B6: fdivrp st(1),st
B8: mov esp,ebp
BA: pop ebp
BB: ret
 


00: mov edx,dword ptr [esp+8] 04: pxor mm3,mm3 07: mov eax,1 0C: movd mm0,eax
0F: punpcklwd mm0,mm0
12: punpcklwd mm0,mm0
15: movq mm4,mm0 18: movq mm1,mm3 1B: movq mm2,mm3 1E: test edx,edx 20: jle 00000046
22: mov ecx,dword ptr [esp+4] 26: mov eax,dword ptr [ecx]
28: movd mm0,eax 2B: punpcklbw mm0,mm3 2E: movq mm5,mm0
31: pmaddwd mm5,mm4 34: paddd mm1,mm5 37: add ecx,4 3A: dec edx 3B: movq mm5,mm0 3E: pmaddwd mm5,mm0 41: paddd mm2,mm5 44: jne 00000026

46: movq mm0,mm1 49: psrlq mm0,20h 4D: paddd mm0,mm1 50: movd eax,mm0 53: movq mm0,mm2 56: psrlq mm0,20h 5A: paddd mm0,mm2 5D: movd ecx,mm0 60: emms 62: fild dword ptr [esp+8] 66: mov dword ptr [esp+8],eax 6A: fmul qword ptr [__real@4010000000000000] 70: fild dword ptr [esp+8] 74: test eax,eax 76: jns 0000007E 78: fadd qword ptr [__real@41f0000000000000] 7E: mov dword ptr [esp+8],ecx 82: fild dword ptr [esp+8] 86: test ecx,ecx 88: jns 00000090 8A: fadd qword ptr [__real@41f0000000000000] 90: fmul st,st(2) 92: fld st(1) 94: fmulp st(2),st 96: fsubrp st(1),st 98: fld st(1) 9A: fsub qword ptr [__real@3ff0000000000000] A0: fmulp st(2),st A2: fdivrp st(1),st

A4: ret

The inner loop is highlighted in red. The first thing I'll point out is that the code is correct; MMX intrinsics were troublesome in VC7.1 because the compiler had a tendency to hoist floating-point operations above calls to _mm_empty(), which fortunately has long been fixed.

I've omitted the disassembly for VS2005 SP1 because it's nearly the same as VS2008 SP1, except for a couple of very minor differences like add eax,1 vs. inc eax. One immediately noticeable difference is that the VS2010 compiler (VC10) generated smaller code than the VS2008 SP1 (VC9), ~13% shorter. Digging into the details, we can see that:

This is a bit of a nice surprise, given that I hadn't expected any improvement in MMX code generation at all. The reduction in code size is also accompanied by a slight increase in execution speed, which I measure at 2412 clocks vs. 2537 clocks for a 2K block on my 45nm Core 2. 5% isn't much, but I'll take it. Unfortunately, although the SSE set intrinsics have been improved, the MMX intrinsics haven't, and the compiler still emits a bunch of code to compute the (1, 1, 1, 1) vector instead of computing the final value. The compiler is also still unable to emit a direct 32-bit load, always preferring to bounce through GPRs. That is the main problem I've had with the VC++ implementation of MMX/SSE2 intrinsics, as I work a lot with 32-bit pixels.

I did check SSE2 code generation as well, and the differences there are fewer. VC8/9 already had improvements in SSE copy propagation, so no advantage there. However, VC10 still pulls ahead due to omitting the aligned stack frame and much better code generation for the set intrinsics. This means that entry/initialization code will tend to benefit a lot more than inner loops. (I can no longer use _mm_set_epi8() as my poster child for bad code generation; it was my favorite as it generated 18 instructions in VC8 and 74 instructions in VC9. VC10 generates a single instruction with constant input.)

It's nice to see improvement in intrinsics support, but after all this time, I still don't like intrinsics that much. I've warmed up to them a bit, though, since my tolerance for fiddling with manual register allocation is not quite what it used to be and they're handy for prototyping. My wish list:

Comments

Comments posted:


It is better to avoid pointers:

mov eax,dword ptr [ecx]
...
add ecx,4
dec edx

_m_from_int(raw) ==> _m_from_int(((int*)samples)[i])

mov esi, DWORD PTR [edx+eax*4]
...
inc eax

And __restrict is also a huge improvement for intrinsics! If there are both input and output buffers to work with.

Gabest - 16 04 10 - 00:49


So are you going to use SSE4.2 or later extensions in VirtualDub? What about SSE4a, are any of the few instructions there useful in VirtualDub?

Yuhong Bao - 16 04 10 - 12:49


@Gabest:
I've found indexing to be hit or miss; sometimes the additional register references and longer instructions are actually a minus. In this case, though, your version is 2% faster on my machine (2362 clocks vs. 2412 clocks). Of course, unrolling to use dual accumulators would help even more.

@Yuhong Bao:
Not really. The only instruction that looks mildly interesting is POPCNT, and I don't generally have a need for that in inner loops. Besides, I don't have a CPU that supports those instructions.

Phaeron - 16 04 10 - 15:44

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.