Current version

v1.9.11 (stable)
v1.10.3 (exp.)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Calendar

« October 2014
S M T W T F S
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

Archives

01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Compiler intrinsics... again

You know that episode of The Simpsons where Bart reaches for the electrified cookie jar and goes "ow," and then just keeps doing it again and again? Yeah, I'm like that with compiler intrinsics.

Let's take a simple routine:

__m128i fold1(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(mask, _mm_srli_epi16(x, 1)), _mm_and_si128(mask, x));
}

This is one step of a population count routine, which folds pairs of bits together into two-bit counts. (Yeah, I know this can be done better with subtraction, but popcount isn't the subject here.) Run this through VC10, and you get this:

movdqa      xmm1,xmmword ptr [__xmm@0]
movdqa      xmm2,xmm0
movdqa      xmm0,xmm1
movdqa      xmm3,xmm2
psrlw       xmm3,1
pand        xmm0,xmm3
pand        xmm1,xmm2
paddw       xmm0,xmm1
ret

Unnecessary moves blah blah blah... you've heard it here before. Then again, let's take a closer look. Why did the compiler emit the MOVDQA XMM3, XMM2 instruction? Hmm, it's because it did the shift next, but it still needed to keep "x" around for the second operation. And how about that PAND that follows? Well, it couldn't modify "mask," so it copied that too. Waaaiit a minute, it's just doing everything exactly the way I told it. That might be OK if x86 used three-argument form instructions, but since x86 is two-argument, that kinda sucks. What about if we rewrote the routine this way:

__m128i fold2(__m128i x) {
__m128i mask = _mm_set1_epi16(0x5555);
return _mm_add_epi16(_mm_and_si128(_mm_srli_epi16(x, 1), mask), _mm_and_si128(mask, x));
}

movdqa      xmm1,xmmword ptr [__xmm@0]
movdqa      xmm2,xmm0
psrlw       xmm0,1
pand        xmm0,xmm1
pand        xmm1,xmm2
paddw       xmm0,xmm1
ret

Well, that looks a bit better. It appears that Visual C++ is unable to take advantage of the fact that the binary operations used here are commutative, which means that the efficiency of the code generated can differ significantly based on the order of the arguments even though the result is the same. The upside is that you can swap around arguments to get better code; the downside is that you're doing what the code generator should be doing. Interestingly, based on some experiments it looks like the code generator can do this for scalar operations, so something didn't get hooked up or extended to the intrinsics portion.

Anyway, if you've got extra moves showing up in the disassembly when using intrinsics, try shaking the expression tree a bit and see if some of the moves fall out.

(Read more....)

§ Your video player is not awesome

I'm pissed.

I needed to install a third-party video player recently to diagnose a problem with paletted video, only to discover that it was really, fatally broken in that regard. Okay, I can't give too much crap for that, because I've broken paletted video plenty of times in VirtualDub. However, this is the first time that I've seen a decoder broken not only such that it uses the wrong stride to decode the video, but that the stride used depends on the size of the window. At that point I decided that getting a paletted video stream to work in this player was useless, and decided to uninstall it.

That's when I found out how much damage that this player had done to my system.

You see, this player is so awesome that it automatically decided to silently register itself as the default player for ALL video types, including AVI, MPEG, and ASF. Hey, it plays Flash video too, so why not take SWF? People store MPEG video in DAT files, so let's take that too, since nobody would ever use .DAT for anything else, right? And while we're at it, we'll take the .AVS Avisynth extension, because obviously if you're using an Avisynth script it's because you just want to play the result. The File menu in Explorer is a bit lonely too, so we'll add half a dozen menu entries just for whatever you'd want to do with this wonderful player.

Okay, I've been through this before... just reassociate the files with the One True Player(tm) (i.e. Media Player Classic) and go on. Or not. You see, this player also decided to register all new file types in Explorer, changing every single multimedia file type to use its own icon and label, so that instead of "Video file" for .AVI, it would show up as FOO - Video File, even if the type was changed back to use a different player than FOO. Which made me very unhappy as I then had to use Registry Editor to manually fix each and every single file type that had been farked up by this stupid player application, and thus ensuring that this player stays permanently on my Do Not Install shiatlist.

Don't encourage programmers who do selfish things like this.

(Read more....)

§ DirectShow gone awry

A few days ago I discovered that some prototype DirectShow-based code I had was suddenly taking a lot longer to open files. By a lot longer, I mean up to a minute -- at full CPU. As you might imagine, this was pretty irritating, especially since not only was it running at full CPU, but it was doing something that made the entire system performance especially suck during that time. Great.

A bit of digging with the mighty F12 profiler -- actually, I guess it was Ctrl+Break, since I was using CDB -- revealed it to be the DirectShow filter graph "intelligent connect" code. Specifically, it was taking an abnormally long time to connect the audio sample grabber. "Intelligent connect" in DirectShow refers to the way in which the filter graph manager will automatically find a sequence of intermediate filters to connect two filters together whenever a direct connection isn't possible. For instance, trying to connect a renderer that wants uncompressed video to a compressed video source will result in a video decoder being stuck in between. As you might imagine, this is both handy and hazardous, the latter coming into play when the filter graph comes up with some horror like MJPEG Compressor + MJPEG Decompressor to do a color conversion. I had suspected that at first, but inspection of the resulting filter graph via GraphEdit's remote connect function didn't show anything unusual.

Some more investigation with the debugger revealed that a lot of time was being spent in creating and destroying DirectDraw surfaces, which some filter was using as part of its media type check -- not a great idea, considering how expensive it is and how often media type queries happen. For a moment, I had thought maybe some application I had installed recently had added a ton of slow or broken filters, which I'd have to hunt down and then uninstall. The situation was pretty bad too, because the filter graph manager was recursing a lot and trying some pretty deep chains of filters. Then it dawned on me... why was the filter graph manager trying so many video filters to connect an audio filter? Shouldn't it know that it already had an audio stream, and that only audio filters should be checked? Unless....

I checked the connection code again, and it turned out that I wasn't trying to connect an audio pin, but rather a source type pin. That meant that the intelligent connect code had to figure out both the demultiplex and decoder filters for the intermediate connections. Then, after checking the sample grabber code, I had a light bulb moment. It turns out that I hadn't reimplemented the EnumMediaTypes() code on the sample grabber's input pin, so it was returning no media type structures. That meant that the filter graph manager was trying to establish a connection with the following media type information:

The sample grabber did check the media type in the query function, so it only accepted audio connections. However, the filter graph manager had no way to know this since EnumMediaTypes() returned nothing, so the only way it found a connection was to do a brute force search through all possible combinations of filters that would make Dijkstra proud. And when you have M filters that can be combined up to a chain N long, the result unsurprisingly is a whole lot of CPU time spent trying connections. So I reimplemented EnumMediaTypes() to return a single entry with the media type set properly, and suddenly load time dropped to sub-second range.

Moral of the story? Make sure your filter isn't being too ambiguous with its reported connection requirements.

(Read more....)

§ Optimizing a FIR filter routine

Recently I had to implement a low-pass audio filter in software. A low-pass filter is so named because it passes low frequencies while muting high ones, similar to what you'd get by turning treble all the way down on a stereo. Low-pass filters have a number of uses, the particular use in this case being to prevent aliasing in a subsequent resampling pass.

There are many ways to implement a low-pass filter, but the method that I used was a finite impulse response (FIR) filter. FIR filters have a few advantages, such as simplicity of implementation in software and ease of making linear-phase filters. The cutoff frequency was fairly high, so the FIR filter kernel didn't need that many taps -- a 15 tap symmetric filter was enough. 

To retell the tale, let's start with this routine:

void filter(float *dst, const float *src, size_t n, const float *kernel) {
    const float k0 = kernel[0];
    const float k1 = kernel[1];
    const float k2 = kernel[2];
    const float k3 = kernel[3];
    const float k4 = kernel[4];
    const float k5 = kernel[5];
    const float k6 = kernel[6];
    const float k7 = kernel[7];
    do {
        float v = src[7] * k0
                + (src[ 6] + src[ 8]) * k1
                + (src[ 5] + src[ 9]) * k2
                + (src[ 4] + src[10]) * k3
                + (src[ 3] + src[11]) * k4
                + (src[ 2] + src[12]) * k5
                + (src[ 1] + src[13]) * k6
                + (src[ 0] + src[14]) * k7;
        ++src;
        *dst++ = v;
    } while(--n);
}
(Read more....)

§ Hardware overlays in Windows 7 RTM

A while back I wrote about Direct3D9Ex overlays in Windows 7, based on some testing I'd done in Windows 7 RC. Well, I have Windows 7 x64 RTM installed now, so I thought I'd rerun my tests.

Same results. Can't create YUV overlays, LIMITEDRGB flag does nothing, no stretching supported.

Another API broken from the beginning. :(

(Read more....)