§ ¶Weird optimizer behavior
Take this simple function:
int foo(int p0, int p1) {
return p1 + ((p0 - p1) & ((p0 - p1) << 31));
}
This should generate four operations: a subtract, a shift, a bitwise AND, and an add. Well, with VS2005 it generates something a bit different:
00000000: 8B 54 24 08 mov edx,dword ptr [esp+8]
00000004: 8B 4C 24 04 mov ecx,dword ptr [esp+4]
00000008: 2B CA sub ecx,edx
0000000A: 8B C1 mov eax,ecx
0000000C: 69 C0 00 00 00 80 imul eax,eax,80000000h ???
00000012: 23 C1 and eax,ecx
00000014: 03 C2 add eax,edx
00000016: C3 ret
Somehow the optimizer changed the shift to a multiply, which is a serious pessimization and thus results in a rare case where the code is actually faster with the optimizer turned off!
Oddly enough, manually hoisting out the common subexpression (p0 - p1) fixes the problem. I've seen this behavior before in VC++ with 64-bit expressions of the form (a*b+c). My guess is that the compiler normally converts left shifts to multiplications and then converts them back later, but somehow the CSE optimization breaks this. Yet another reason that being lazy and repeating common subexpressions all over the place while relying on the optimizer to clean up your mess isn't the greatest idea.
The reason for the repeated subexpression, by the way, is because this is an expanded version of a min() macro. I called the function foo above instead of min because it's actually broken -- the left shift should be a right shift. As long as you can put up with the portability and range quirks, this strange formulation has the advantages of (a) being branchless, and (b) sharing a lot of code with a max() on the same arguments.
(Read more....)§ ¶VirtualDub 1.8.7 and 1.9.0 released
I haven't had as much time as I'd like to work on VirtualDub, which is unfortunately why it's been three months since the last release. Time to rectify that.
Both 1.8.7 and 1.9.0 are now up on SourceForge. 1.8.7 is a bugfix only release, with the one major fix being to the distributed job system. It turns out that the distributed job code wasn't that stable and would often attempt to run the same job on multiple machines, due to essentially a race condition in the filesystem. The new version now has logic to detect job start conflicts and retry with exponential delay, which should be more reliable. I also rewrote the conflict resolution logic, which is now more similar to the two-way and three-way merges that a revision control system has to deal with.
1.9.0 is of course the new experimental build and contains a number of new features and changes. I spent some time closing the gap in functionality between the x86 and AMD64 builds, so although the AMD64 build may still not be as well optimized, several features that were previously absent in the AMD64 build are now implemented. I've also thrown in a built-in AMD64-capable Huffyuv decoder that handles some of the popular post-2.1.1 extensions. Second, the internal display and blitter libraries got overhauled quite a bit. The uberblit system that backs the resampler in the 1.8.x series has been cleaned up and expanded, and now handles many of the complex blit scenarios that were previously handled by custom code or multi-stage blits. As a result, VirtualDub 1.9.0 can now handle several new image formats, including the 10-bit per channel v210 format and the interleaved NV12 format. The display library has also been upgraded to handle the new formats, and in particular the Direct3D module can now accelerate display of 10 bit/channel v210 video with dithering. The new formats are not yet exposed to video filters -- mainly because the thought of trying to work directly in v210 scares me -- although I'm not ruling out the possibility of a 14-bit fixed point linear color format in the future.
Changelists are after the jump.
(Read more....)§ ¶"10 is the new 6" my #*&
I'm trying to give the Visual Studio team the benefit of the doubt with their "10 is the new 6" push, but I just tested something in the VS2010 CTP and nearly blew my top. Therefore, it's rant time.
A long time ago, I filed a bug on Visual Studio 2005, or really, VS2003:
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=100052
This bug's really simple: if you have a stack of overlapping controls in the Visual Studio dialog editor and click on them, the editor selects the one on the bottom. It's a pain in the butt, because when you've got a bunch of overlapping controls that you're trying to fix -- such as from a copy-and-paste -- you try clicking on one to drag it and you end up messing up the positioning of some other control, so you've got to undo that and then try to find some way to select the one you actually wanted. You can't marquee the errant, control, because the marquee is inclusive and invariably also picks the group box that surrounds it. Shift-clicking out the unwanted controls is dangerous, because it sometimes registers as a double-click and then you get a random event handler added to your code or something.
I checked VS2010. It still picks the control on the bottom. It's still broken. It's been broken since Visual Studio .NET 2002, I've been waiting more than eight years for them to fix this bug, and it is STILL BROKEN. Does anyone actually use this anymore??
I really do want to give the Visual Studio team the benefit of the doubt. I do like a lot of the improvements in the compiler. Yet, I'm afraid that if I ever met a member of the IDE team that I would wrap my hands around his neck and strangle him, for the sheer amount of pain his team has inflicted upon me. I mean, c'mon. What visual editor with draggable components selects the one on the bottom?? There are so many other pet peeves of mine that still aren't fixed. All I have to do is look at the nearly unchanged project settings dialog to get the sinking feeling that the team still doesn't really get what they need to do to achieve "10 is the new 6." Then I look at the new MSBuild-based C++ project system in progress, which takes more than 30 seconds to load the converted VirtualDub.sln and prints out a 400+ column command line by default for every file group that it builds, and I get really depressed. And I look over the fence at other stuff like the XAML editor, and things don't really look that rosier over there, either.
Please, make VS2010 better. I tried Eclipse once and I hated it. I'd have to wear a bag over my head if I had to resort to EMACS. I don't want to succumb to the temptation of writing my own IDE. I don't need new features. I just need what's there now to work well.
(Read more....)§ ¶Good approximation, bad approximation
Numerical approximations are a bit of an art. There are frequently tradeoffs available between speed and accuracy, and knowing how much you can skimp on one to improve the other takes a lot of careful study.
Take the humble reciprocal operation, for instance.
The reciprocal, y = 1/x, is a fairly basic operation, but it's also an expensive one. It can easily be implemented in terms of a divide, but division is itself a very expensive operation — it can't be parallelized as easily as addition or multiplication and typically takes on the order of 10-20 times longer. There are algorithms to compute the reciprocal directly to varying levels of precision, and some CPUs provide acceleration for that. x86 CPUs are among them, providing the RCPSS and RCPPS opcodes to compute scalar and vectorized reciprocal approximations, respectively.
However, there is a gotcha.
For any approximation, the first question you should ask is how accurate it is. RCPSS and RCPPS are documented as having a relative error no greater than 1.5*2^-12, or approximately 12 bits of precision. That's fine, and good enough for a lot of purposes, especially with refinement. The second question you should ask is whether there are any special values involved that deserve special attention. I can think of five that ideally should be exact:
- 0
- -/+ 1.0
- -/+ infinity
RCPSS/RCPPS do handle zero and infinity correctly, which is good. Sadly, 1.0 is handled less gracefully, as it comes out as 0.999756 (3F7FF000), at least on Core 2 CPUs. Even worse, if you attempt to refine the result using Newton-Raphson iterations:
x' = x * (2 - x*c)
...the result converges to 0.99999994039535522 (3F7FFFFF), a value just barely off from 1.0 that in many cases it will be printed as 1.0, such as in the VC++ debugger. This leads to lots of fun tracking down why calculations are slewing away from unity when they shouldn't, only to discover that the innocuous little 1.0 in the corner is actually an impostor, and then having to slow down an otherwise speedy routine to handle this special case. Argh!
If I had to take a guess as to why Intel did this, it's probably to avoid the need to propagate carries from the mantissa to the exponent, because otherwise the top 12 bits of the mantissa can go through a lookup table and the exponent can produce the result exponent and top bit of new mantissa. It's still really annoying, though. I have to go fix the rcp instruction in my shader engine, for instance, because the DirectX docs explicitly require that rcp of 1.0 stays 1.0. Curiously, they don't mention -1.0. I guess it just goes to show how hard it is to specify a good approximation.
(Read more....)