How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

Question

Welcome To Ask or Share your Answers For Others

How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

Below are the given c++ and ARM code for same program. Can you tell me this ARM code is optimized or not and how many does the loop requires(The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.)? What should I do to optimize the code using loop unrolling (process 4 elements at a time)?

c++ code:

// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr, 
 unsigned char k, int n)
{
for(int i=0; i<n; i++)
 *outArr++ = *inArr++ ^ k; // ^ is bitwise xor
}

ARM CODE:

invert:
        cmp     r3, #0
        bxle    lr
        add     ip, r0, r3
.L3:
        ldrb    r3, [r1], #1    @ zero_extendqisi2
        eor     r3, r3, r2
        strb    r3, [r0], #1
        cmp     ip, r0
        bne     .L3
        bx      lr

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:23:25+0000

I have no idea what 'code preload' means. There is data preloading with the pld instruction. It would make sense in the context of the sample code.

Here is the basic 'C' version given the assumptions,

The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.

The code is probably not perfect, but meant to illustrate.

// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr, 
 unsigned char k, int n)
{
  unsigned int *out = (void*)outArr;
  unsigned int *in  = (void*)inArr;
  unsigned int mask = k<<24|k<<16|k<<8|k;

  /* Check arguments */
  if( n % 64 != 0) return;
  if((int)outArr & 3) return;
  if((int)inArr & 3) return;
  assert(sizeof(int)==4);

  for(int i=0; i<n/sizeof(int); i+=64/(sizeof(int)) {
   /* 16 transfers per loop 64/4 */
   *out++ = *in++ ^ k; // 1
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 5
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 9
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 13
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
  }
}

You can view the output on godbolt.

The ldm and stm instructions can be used to load consecutive memory addresses to registers. We can not use all 16 ARM registers, so the core of the loop in assembler would look like this,

  ldmia  [r1], {r4,r5,r6,r7,r8,r9,r10,r11}  # r1 is inArr
  eor    r4,r4,r2   # r2 is expanded k
  eor    r5,r5,r2
  eor    r6,r6,r2
  eor    r7,r7,r2
  eor    r8,r8,r2
  eor    r9,r9,r2
  eor    r10,r10,r2
  eor    r11,r11,r2
  stmia  [r0], {r4,r5,r6,r7,r8,r9,r10,r11}  # r0 is outArr

This is repeated twice and the R0 or R1 can be checked against the array limits stored in R3. You need to save all of the callee saved registers if you want to be EABI compliant. The register set r4-r11 can generally be used, but it will depend on the system. You can also use lr, fp, etc if you save them and are not exception safe.

From the comments,

I am trying to find that how many cycles does this subroutine take per array element when it is optimized and when it isn't.

Cycle counts are extremely difficult on modern CPUs. However you have five instructions in the core with a simple loop,

.L3:
        ldrb    r3, [r1], #1    @ zero_extendqisi2
        eor     r3, r3, r2
        strb    r3, [r0], #1
        cmp     ip, r0
        bne     .L3

To do 32 bytes, this is 32 * 5 (160) instructions. With 32 * 2 memory accesses.

The expanded options is just one 32byte memory read and write. These will complete, with the lowest value available first. Then just a single EOR instruction. So it is just 10 instructions versus 160. On modern processors the memory will be the limiting factor. Because of memory stalls, it maybe better to only process four words at a time such as,

  ldmia  [r1], {r4,r5,r6,r7}  # r1 is inArr
  eor    r4,r4,r2   # r2 is expanded k
  eor    r5,r5,r2
  eor    r6,r6,r2
  eor    r7,r7,r2
  ldmia  [r1], {r8,r9,r10,r11}  # r1 is inArr
  stmia  [r0], {r4,r5,r6,r7}    # r0 is outArr
  ...

This (or some permutation) will allow the load/store unit and the 'eor' to not block each other, but this will depend on the particular CPU type. This topic is called instruction scheduling; it is more powerful than pld or data preloading. As well, you can use NEON or ARM64 instructions so that the body of the loop can do more eor operations before a load/store.

Categories

How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags