¶They called _what_ in the inner loop??
AMD just open sourced the AMD Performance Library as Framewave, which at least from my perspective seems like a good thing. Not that I'm going to attempt to use it, but I perused the source out of curiosity, and it looks like there are some useful goodies in there.
And then there's some... marginal stuff.
One thing that I wanted to look at was their 8x8 2D-IDCT source. The 8x8 2D inverse discrete cosine transform (IDCT) is popular and used in a number of video compression formats. There are a million ways to implement it quickly, and although everyone's seen Intel's AP-922 SSE2 algorithm for it by now, I hadn't seen one by AMD before. So I grab the source and dig around in the JPEG module, and I see this:
int IdctQuant_LS_SSE2(const Fw16s *pSrc, Fw8u *pDst, int dstStp, const Fw16u *pQuantInvTable) {
... pedx = (Fw16s *) fwMalloc(128); //64 array of Fw16s type
Who the #*@&*( calls malloc() in an optimized IDCT routine???
It looks like there are indeed a number of well-optimized SSE2 routines in the Framewave library, but after seeing things like the above a few times I was left scratching my head a bit....
Another uglyness I saw, which isn't restricted to Framewave unfortunately, is assembly language routines that have been translated to intrinsics. The result is a nasty C++ routine that has variables like "pedx" and "pesi," but has instruction names translated so that what used to be an understandable "paddw" is now "_mm_add_epi16." I know this was a hack job for portability, but the result sure is unreadable.