[Mep-dev] NEON test code explorations

Michelle w5nyv at yahoo.com
Wed Jul 14 11:57:54 PDT 2010


Hello everyone! Here's a bit of what's going on this week. 

We were looking again at dotproduct.c from the neon_test archive. This time, we 
were referring to it as a template for creating a version of dotproduct that 
takes 16-bit signed integer inputs and returns a 32-bit signed integer output. 
This is in order to port a demodulator Phil Karn is working on to ARM/NEON. This 
forced us to think more carefully aboutregister allocation, and we ran into a 
subtlety we didn't fully understand.

In the original dotproduct.c, half of the elements are multiply-accumulated into 
q8, and half into q9. Then at the end, q9 is added into q8. Like this:


...
vld1.32 {d0,d1,d2,d3}, [%1]!
>vld1.32 {d4,d5,d6,d7}, [%2]!
>vmla.f32 q8, q0, q2
>vmla.f32 q9, q1, q3
>bgt 1b
>vadd.f32 q8, q8, q9
>...
>

We thought, why not change this to accumulate directly into q8? This would be 
logically equivalent:

...
vld1.32 {d0,d1,d2,d3}, [%1]!
>vld1.32 {d4,d5,d6,d7}, [%2]!
>vmla.f32 q8, q0, q2
>vmla.f32 q8, q1, q3
>bgt 1b
># no vadd.f32 required
>...
>

We ran this on the Beagleboard, and it does indeed get the same answer. But it's 
about 30% slower.

The two vmla instructions in the original code are independent in both inputs 
and outputs, whereas the two vmla instructions in our new version share the 
destination register q8. It would appear that the execution hardware in the 
processor on the Beagleboard is smart enough to do those two operations largely 
in parallel if the inputs and outputs are independent. That's very cool.

So, the questions are:

1. Is that the correct explanation of what's happening?

2. How were we supposed to know that such parallel execution was possible? Put 
another way, how can we find out whether the same is true of any other two 
operations, short of coding up multiple versions and trying them out?

3. There must be a limit on how many operations (such as vmla's) can proceed in 
parallel. I think there are enough registers available to do more in the 
dotproduct.c code, but the original code stopped at two. So I'm guessing the 
author was able to know that two was the maximum for parallel execution. Is this 
in the documentation somewhere or did the author have to run experiments?

Apologies to Philip Balister, who will get a slightly longer version of this 
email! 

We're making progress learning ARM and NEON code and enjoying ourselves in the 
process. 

-Michelle W5NYV, Paul KB5MU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://uppermeadow.com/pipermail/mep-dev/attachments/20100714/e5c8935b/attachment.html 


More information about the Mep-dev mailing list