Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
478 views
in Technique[技术] by (71.8m points)

architecture - Differentiate data from instructions in ARM

In (32-bit) ARM Linux kernels, how to differentiate data embedded in the code section, from instructions?

It is better to have a light-weight approach, like bit masks, which can be easily implemented. It is not wise to embed a dissembler into the kernel.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In general, what you're asking for is impossible.

Consider this function which happens to use a data value too big to encode as an immediate:

@ void patch_nop(void *code_addr);
patch_nop:
    ldr r1, =0xe1a00000
    str r1, [r0]
    bx lr

which, by the time it's been through an assembler and back, looks like this:

$ arm-none-eabi-objdump -d a.out

a.out:     file format elf32-littlearm


Disassembly of section .text:

    00000000 <patch_nop>:
       0:   e59f1004        ldr     r1, [pc, #4]    ; c <patch_nop+0xc>
       4:   e5801000        str     r1, [r0]
       8:   e12fff1e        bx      lr
       c:   e1a00000        .word   0xe1a00000

Thanks to the ELF data, we can still ascertain where the function ends and the literal pool begins, but the work objdump is doing to dig through the sections and symbols is hardly 'lightweight', and who says you have those anyway? What if you have just the code?

$ arm-none-eabi-objcopy -Obinary a.out bin
$ arm-none-eabi-objdump -D -marm -bbinary bin

bin:     file format binary


Disassembly of section .data:

00000000 <.data>:
   0:   e59f1004        ldr     r1, [pc, #4]    ; 0xc
   4:   e5801000        str     r1, [r0]
   8:   e12fff1e        bx      lr
   c:   e1a00000        nop                     ; (mov r0, r0)

There. Embedded in your instruction stream, you have data, which is an instruction. Not even data which accidentally happens to look like an instruction. There is literally nothing you can take from those 32 bits alone to infer that they are not going to be executed (well, not from that location at least).

There are a few heuristics which might help make an educated guess, particularly if any additional prior knowledge can be assumed to narrow it down:

  • Anything which can be encoded as an immediate is almost certainly an instruction, because a compiler/assembler wouldn't have emitted it as a literal in the first place. However, you'd ideally want to know at least whether the preceding code is ARM or Thumb in order to know what the appropriate immediate range is*.

  • Anything which is an undefined instruction is usually going to be data, unless it so happens that it's code which wants to intentionally raise an undef exception. And you essentially have to have most of a disassembler to check that something doesn't match any defined encoding. On top of the ARM/Thumb thing.

  • Anything immediately following an unconditional branch might be literal data, particularly if you have symbols and can tell it's very close to the start of the following function, or if you have some knowledge of the data you're looking for and it looks like data. The latter point is certainly relevant if you're just eyeballing disassembly - in practice literal data tends to be stuff like addresses, which generally stand out like a sore thumb? once you look at the code as a whole.

  • The most reliable way to check if something is a literal is to look through the preceding code (up to 1025 instructions away) checking for a PC-relative load targeting that address. You'd only need to check against literal load encodings (there's your simple bitmasking operation), then decode the relative offset if you find one. Ideally you'd want to solve the ARM/Thumb thing to avoid false positives from checking against inappropriate encodings, and in the most absolutely pathological case you could still run into some data in a preceding literal pool which happens to look like a literal load targeting your address; never say never.

And of course, that's still all assuming literal pools automatically emitted by a compiler/assembler; when it comes to entirely handwritten assembly code, all bets are off:

patch_nop2:
    ldr r1, [pc, #-4]
    mov r0, r0
    str r1, [r0]
    bx lr

Is is code? Yes. Is it data? Yes.

* Incidentally, discerning between ARM and Thumb code boils down to essentially the same problem as this one - "what does this bit pattern mean?" - and is equally non-trivial without external help.

? No pun intended


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...