[Mep-dev] NEON test code explorations
Michelle
w5nyv at yahoo.com
Wed Jul 14 11:57:54 PDT 2010
Hello everyone! Here's a bit of what's going on this week.
We were looking again at dotproduct.c from the neon_test archive. This time, we
were referring to it as a template for creating a version of dotproduct that
takes 16-bit signed integer inputs and returns a 32-bit signed integer output.
This is in order to port a demodulator Phil Karn is working on to ARM/NEON. This
forced us to think more carefully aboutregister allocation, and we ran into a
subtlety we didn't fully understand.
In the original dotproduct.c, half of the elements are multiply-accumulated into
q8, and half into q9. Then at the end, q9 is added into q8. Like this:
...
vld1.32 {d0,d1,d2,d3}, [%1]!
>vld1.32 {d4,d5,d6,d7}, [%2]!
>vmla.f32 q8, q0, q2
>vmla.f32 q9, q1, q3
>bgt 1b
>vadd.f32 q8, q8, q9
>...
>
We thought, why not change this to accumulate directly into q8? This would be
logically equivalent:
...
vld1.32 {d0,d1,d2,d3}, [%1]!
>vld1.32 {d4,d5,d6,d7}, [%2]!
>vmla.f32 q8, q0, q2
>vmla.f32 q8, q1, q3
>bgt 1b
># no vadd.f32 required
>...
>
We ran this on the Beagleboard, and it does indeed get the same answer. But it's
about 30% slower.
The two vmla instructions in the original code are independent in both inputs
and outputs, whereas the two vmla instructions in our new version share the
destination register q8. It would appear that the execution hardware in the
processor on the Beagleboard is smart enough to do those two operations largely
in parallel if the inputs and outputs are independent. That's very cool.
So, the questions are:
1. Is that the correct explanation of what's happening?
2. How were we supposed to know that such parallel execution was possible? Put
another way, how can we find out whether the same is true of any other two
operations, short of coding up multiple versions and trying them out?
3. There must be a limit on how many operations (such as vmla's) can proceed in
parallel. I think there are enough registers available to do more in the
dotproduct.c code, but the original code stopped at two. So I'm guessing the
author was able to know that two was the maximum for parallel execution. Is this
in the documentation somewhere or did the author have to run experiments?
Apologies to Philip Balister, who will get a slightly longer version of this
email!
We're making progress learning ARM and NEON code and enjoying ourselves in the
process.
-Michelle W5NYV, Paul KB5MU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://uppermeadow.com/pipermail/mep-dev/attachments/20100714/e5c8935b/attachment.html
More information about the Mep-dev
mailing list