|
|
Introduction to MMX
|
emms |
Empties the MMX state so that the MMX registers can be used by floating point operations. |
movd |
Moves a double word (32 bits) between MMX registers or to/from memory. |
movq |
Moves a quad word (64 bits) between MMX registers or to/from memory. |
packssdw |
Packs 128 bits of words or double words into 64 bits by removing the top half of each unit with signed or unsigned saturation. |
paddb |
Adds two groups of bytes, words, or double words. |
paddq |
Adds two quad words. |
paddsb |
Adds signed bytes or words with signed or unsigned saturation. |
pand |
Bitwise AND and AND NOT. |
pcmpeqb |
Compares each unit in the groups for equality |
pcmpgtb |
Compares each unit in the groups for greater than. |
pmaddwd |
Multiply and addition of signed words into double words. |
pmulhw |
Multiply of signed words and return high or low word. |
por |
Bitwise OR. |
psllw |
Logical left shift per unit. |
psraw |
Arithmetic right shift per unit. |
psrlw |
Logical right shift per unit. |
psubb |
Subtract integers. |
psubsb |
Subtract integers with signed or unsigned saturation. |
punpckhbw |
Double the unit size by interleaving units from two sources. |
pxor |
Bitwise XOR. |
With the Pentium III the MMX instruction set was increased with the following instructions.
maskmovq |
Write bytes to memory from register using a mask to select which bytes to write. |
movntq |
Moves quad word to memory bypassing the cache. |
pavgb |
Compute the average of unsigned bytes or words. |
pextrw |
Extracts a specified word from a group. |
pinsrw |
Inserts a word at specified location in group. |
pmaxsw |
Compares signed words and stores the largest. |
pmaxub |
Compares unsigned bytes and stores the largest. |
pminsw |
Compares signed words and stores the smallest. |
pminub |
Compares unsigned bytes and stores the smallest. |
pmovmskb |
Creates an 8-bit integer from the most significant bit in each byte. |
pmulhuw |
Multiply unsigned words and return high word. |
psadbw |
Computes the absolute pairwise difference of 8 unsigned bytes and sums them into 1 word. |
pshufw |
Shuffles words. |
All the instructions take 1 clock cycle to execute, except for the multiplication that take 3. Most of the instructions can also be paired for execution in parallel. To see a more detailed description of the instructions read the NASM manual or the Intel Reference Manual, that is available from intel.com.
To give you a better understanding of what can be done with MMX I've written a small function that blends two 32-bit ARGB pixels using 4 8-bit factors, one for each channel. To do this in C++ you would have to do the blending channel by channel. But with MMX we can blend all channels at once.
The blending factor is a one byte value between 0 and 255, as is the channel components. Each channel is blended using the following formula.
res = (a*fa + b*(255-fa))/255
Writing this in C++ is as easy as it looks, and even writing it in assembler is quite straight forward. However for MMX we have a problem, there is no packed division operation available. We can do a division by shifting the bits to the right by 8, the problem is that this does a division by 256. This small difference might not be too important if we are doing only one blending pass, but as the passes increases the artifacts increases as well. The solution is that we increase the range of the factor to be between 0 and 256. This is quite simple to do by adding 1 if the factor is above 127.
Without further comments here is the assembler function for blending two ARGB pixels.
; DWORD LerpARGB(DWORD a, DWORD b, DWORD f);
global _LerpARGB
_LerpARGB:
; load the pixels and expand to 4 words
movd mm1, [esp+4] ; mm1 = 0 0 0 0 aA aR aG aB
movd mm2, [esp+8] ; mm2 = 0 0 0 0 bA bR bG bB
pxor mm5, mm5 ; mm5 = 0 0 0 0 0 0 0 0
punpcklbw mm1, mm5 ; mm1 = 0 aA 0 aR 0 aG 0 aB
punpcklbw mm2, mm5 ; mm2 = 0 bA 0 bR 0 bG 0 bB
; load the factor and increase range to [0-256]
movd mm3, [esp+12] ; mm3 = 0 0 0 0 faA faR faG faB
punpcklbw mm3, mm5 ; mm3 = 0 faA 0 faR 0 faG 0 faB
movq mm6, mm3 ; mm6 = faA faR faG faB [0 - 255]
psrlw mm6, 7 ; mm6 = faA faR faG faB [0 - 1]
paddw mm3, mm6 ; mm3 = faA faR faG faB [0 - 256]
; fb = 256 - fa
pcmpeqw mm4, mm4 ; mm4 = 0xFFFF 0xFFFF 0xFFFF 0xFFFF
psrlw mm4, 15 ; mm4 = 1 1 1 1
psllw mm4, 8 ; mm4 = 256 256 256 256
psubw mm4, mm3 ; mm4 = fbA fbR fbG fbB
; res = (a*fa + b*fb)/256
pmullw mm1, mm3 ; mm1 = aA aR aG aB
pmullw mm2, mm4 ; mm2 = bA bR bG bB
paddw mm1, mm2 ; mm1 = rA rR rG rB
psrlw mm1, 8 ; mm1 = 0 rA 0 rR 0 rG 0 rB
; pack into eax
packuswb mm1, mm1 ; mm1 = 0 0 0 0 rA rR rG rB
movd eax, mm1 ; eax = rA rR rG rB
ret
You should note that I've written this function for clarity and not for speed. I have not tried to optimize it for speed by pairing instructions that can be executed in parallel as described by the Intel Optimization Manual, available from intel.com.
With this tutorial I have hopefully been able to inspire some interest into low-level optimizations using assembler and MMX. Now, go ahead and play around with the MMX instructions and see what you can do. Once you feel comfortable using MMX you shouldn't forget that most of the time it just isn't worth it. But for those few times when it is worth it, it is going to show that you know your business.
Questions, comments, and suggestions are as usual more than welcome. After all I'm writing this in hope that I will learn something from you too.
Thanks to Axel Gneiting and Graham Reeds for telling me about some errors in the article. I also thank them for giving me some extra information on AMD and Cyrix processors, even though I ended up not including it in the article.
PC Assembly Tutorial, by Paul Carter
Intel Architecture Software Developer's Manual Volume 1: Basic Architecture, by Intel
Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference, by Intel
Intel Architecture Optimization Reference Manual, by Intel
AMD Processor Recognition Application Note, by AMD
AMD Athlon Processor x86 Code Optimization Guide, by AMD