运行环境
硬件环境为Hygon(x86架构),操作系统为Uos(4.19.0-amd64-desktop)。
问题
glmark2-es2长稳测试时,打印segment fault后退出。用户态除了打印segment fault,无其它有用信息。查看内核日志发现如下打印:
[180478.017641] glmark2-es2[3473]: segfault at 1f2e8 ip 00007f24741bfca0 sp 00007ffdbd47aa98 error 4 in libGLESv2_XXgpu.so.1.1.213621[7f247414e000+180000]
[180478.017646] Code: 00 44 8b 83 00 30 00 00 4c 8d 0d 9e 14 13 00 89 e9 ba 41 00 00 00 be 19 00 00 00 48 8b 38 31 c0 e8 e5 c1 0b 00 e9 bc fd ff ff <0f> b6 06 80 3f 00 89 c2 74 25 84 d2 b8 ff ff ff ff 74 1c 8b 4f 08
内核日志分析
x86打印该信息的函数如下,以下为kernel-5.4.191的代码,打印信息除了一些小区别,基本一致:
linux-5.4.191\arch\x86\mm\fault.c
static inline void
show_signal_msg(struct pt_regs *regs, unsigned long error_code,
unsigned long address, struct task_struct *tsk)
{
const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;if (!unhandled_signal(tsk, SIGSEGV))
return;if (!printk_ratelimit())
return;printk("%s%s[%d]: segfault at %lx ip %px sp %px error %lx",
loglvl, tsk->comm, task_pid_nr(tsk), address,
(void *)regs->ip, (void *)regs->sp, error_code);print_vma_addr(KERN_CONT " in ", regs->ip);
printk(KERN_CONT "\n");
show_opcodes(regs, loglvl); //打印code
}linux-5.4.191\mm\memory.c
void print_vma_addr(char *prefix, unsigned long ip)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;/*
* we might be running from an atomic context so we cannot sleep
*/
if (!down_read_trylock(&mm->mmap_sem))
return;vma = find_vma(mm, ip);
if (vma && vma->vm_file) {
struct file *f = vma->vm_file;
char *buf = (char *)__get_free_page(GFP_NOWAIT);
if (buf) {
char *p;p = file_path(f, buf, PAGE_SIZE);
if (IS_ERR(p))
p = "?";
printk("%s%s[%lx+%lx]", prefix, kbasename(p),
vma->vm_start,
vma->vm_end - vma->vm_start);
free_page((unsigned long)buf);
}
}
up_read(&mm->mmap_sem);
}linux-5.4.191\arch\x86\kernel\dumpstack.c
void show_opcodes(struct pt_regs *regs, const char *loglvl)
{
#define PROLOGUE_SIZE 42
#define EPILOGUE_SIZE 21
#define OPCODE_BUFSIZE (PROLOGUE_SIZE + 1 + EPILOGUE_SIZE)
u8 opcodes[OPCODE_BUFSIZE];
unsigned long prologue = regs->ip - PROLOGUE_SIZE;
bool bad_ip;/*
* Make sure userspace isn't trying to trick us into dumping kernel
* memory by pointing the userspace instruction pointer at it.
*/
bad_ip = user_mode(regs) &&
__chk_range_not_ok(prologue, OPCODE_BUFSIZE, TASK_SIZE_MAX);if (bad_ip || probe_kernel_read(opcodes, (u8 *)prologue,
OPCODE_BUFSIZE)) {
printk("%sCode: Bad RIP value.\n", loglvl);
} else {
printk("%sCode: %" __stringify(PROLOGUE_SIZE) "ph <%02x> %"
__stringify(EPILOGUE_SIZE) "ph\n", loglvl, opcodes,
opcodes[PROLOGUE_SIZE], opcodes + PROLOGUE_SIZE + 1);
}
}
根据函数show_opcodes可知,出现segment fault时,会打印造成segment fault指令前的42字节(PROLOGUE_SIZE)字节指令,打印造成segment fault指令(包含该指令)后面的22字节指令(PROLOGUE_SIZE + 1)。造成semgnt fault的指令以<>开始。
参考《How do you read a segfault kernel log message》,打印信息各个字段解释如下:
How do you read a segfault kernel log message
This can be a very simple question, I'm am attempting to debug an application which generates the following segfault error in the kern.log
kernel: myapp[15514]: segfault at 794ef0 ip 080513b sp 794ef0 error 6 in myapp[8048000+24000]
56
When the report points to a program, not a shared library
Run addr2line -e myapp 080513b (and repeat for the other instruction pointer values given) to see where the error is happening. Better, get a debug-instrumented build, and reproduce the problem under a debugger such as gdb.If it's a shared library
In the libfoo.so[NNNNNN+YYYY] part, the NNNNNN is where the library was loaded.(这里有错误,NNNNNN应该是出错指令所在的segment对应的VMA的起始虚拟地址,YYYY是该VMA的虚拟地址空间大小,见上面代码中的函数print_vma_addr) Subtract this from the instruction pointer (ip) and you'll get the offset into the .so of the offending instruction. Then you can use objdump -DCgl libfoo.so and search for the instruction at that offset. You should easily be able to figure out which function it is from the asm labels. If the .so doesn't have optimizations you can also try using addr2line -e libfoo.so <offset>.What the error means
Here's the breakdown of the fields:address 794ef0 - the location in memory the code is trying to access (it's likely that 10 and 11 are offsets from a pointer we expect to be set to a valid value but which is instead pointing to 0) 即访问该地址的数据时出现了错误
ip 080513b - instruction pointer, ie. where the code which is trying to do this lives
sp 794ef0 - stack pointer
error - Architecture-specific flags; see arch/*/mm/fault.c for your platform.libfoo.so - 出错指令所在的库
NNNNNN - 出错指令所在的segment对应的VMA的起始虚拟地址,见函数print_vma_addr
YYYY - VMA的虚拟地址空间大小,见函数print_vma_addr
根据上面解释,可知内核打印信息指示指令地址0x00007f24741bfca0访问了地址0x1f2e8的数据,造成了segment fault,出错指令所在的segment对应的VMA的起始虚拟地址为0x7f247414e000,该VMA的虚拟地址空间大小为0x180000,error为4。分析内核代码流程可知,x86的error号对应的代码如下,4的含义即用户态读取内存时,没有相应的page(即页表未对该地址进行映射),且不是因为取指令时出现的错误:
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
X86_PF_WRITE = 1 << 1,
X86_PF_USER = 1 << 2,
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
};
寻找引起错误的指令
因为该问题极难复现,且没有生成coredump文件,只能通过内核打印的错误信息进一步分析。需要找出执行哪条指令引起的错误。从内核打印的错误信息可知是动态链接库libGLESv2_XXgpu.so.1.1.213621中出现了错误。
运行如下指令得到program header:
root@test-System-Product-Name:/home/segment# readelf -l libGLESv2_XXgpu.so.1.1.213621
Elf file type is DYN (Shared object file)
Entry point 0x11c90
There are 10 program headers, starting at offset 64Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000010a78 0x0000000000010a78 R 0x1000
LOAD 0x0000000000011000 0x0000000000011000 0x0000000000011000
0x000000000017f9e1 0x000000000017f9e1 R E 0x1000
LOAD 0x0000000000191000 0x0000000000191000 0x0000000000191000
0x0000000000052a14 0x0000000000052a14 R 0x1000
LOAD 0x00000000001e4540 0x00000000001e5540 0x00000000001e5540
0x0000000000004119 0x00000000000063c0 RW 0x1000
DYNAMIC 0x00000000001e7d30 0x00000000001e8d30 0x00000000001e8d30
0x0000000000000280 0x0000000000000280 RW 0x8
NOTE 0x0000000000000270 0x0000000000000270 0x0000000000000270
0x0000000000000024 0x0000000000000024 R 0x4
TLS 0x00000000001e4540 0x00000000001e5540 0x00000000001e5540
0x0000000000000000 0x0000000000000010 R 0x8
GNU_EH_FRAME 0x00000000001c5490 0x00000000001c5490 0x00000000001c5490
0x000000000000378c 0x000000000000378c R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x00000000001e4540 0x00000000001e5540 0x00000000001e5540
0x0000000000003ac0 0x0000000000003ac0 R 0x1Section to Segment mapping:
Segment Sections...
00 .note.gnu.build-id .hash .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
01 .init .plt .plt.got .text .fini
02 .rodata .eh_frame_hdr .eh_frame
03 .init_array .fini_array .data.rel.ro .dynamic .got .got.plt .data .bss
04 .dynamic
05 .note.gnu.build-id
06 .tbss
07 .eh_frame_hdr
08
09 .init_array .fini_array .data.rel.ro .dynamic .got
总共有10个segment,其中Segment Sections...中的00、01、......、09与上面Program Headers中的每行一一对应,即00对应Program Headers中第一行,以此类推。segment 01包含.init .plt .plt.got .text .fini总共5个section,它的type为LOAD,LOAD类型的Segtment会被加载到进程的虚拟地址空间,即内核会为每个类型为LOAD的segment创建一个VMA。segment 01的flag为R E,即可读、可运行,为指令。
Align为该segment起始地址对齐要求,即起始地址按0x1000对齐,该segment在动态链接库文件中的偏移为0x0000000000011000,按0x1000对齐了的,该segment对应的VMA的起始虚拟地址也需要按0x1000对齐。从内核打印的出错信息可知,出错指令所在的segment对应的VMA的起始虚拟地址为0x7f247414e000,该地址按0x1000对齐了的,该VMA的虚拟地址空间大小为0x180000,根据上面的readelf -l指令可知该segment的大小为0x000000000017f9e1,而VMA大小是page的整数倍,所以VMA的大小为0x180000。
segment中的所有section具有相同的flag,运行如下命令可知,该segment的section的flag为 A (alloc), X (execute):
root@test-System-Product-Name:/home/segment# readelf -S libGLESv2_XXgpu.so.1.1.213621
There are 30 section headers, starting at offset 0x1e87a8:Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .note.gnu.build-i NOTE 0000000000000270 00000270
0000000000000024 0000000000000000 A 0 0 4
[ 2] .hash HASH 0000000000000298 00000298
0000000000001134 0000000000000004 A 4 0 8
[ 3] .gnu.hash GNU_HASH 00000000000013d0 000013d0
0000000000000b04 0000000000000000 A 4 0 8
[ 4] .dynsym DYNSYM 0000000000001ed8 00001ed8
0000000000003630 0000000000000018 A 5 1 8
[ 5] .dynstr STRTAB 0000000000005508 00005508
0000000000002b9f 0000000000000000 A 0 0 1
[ 6] .gnu.version VERSYM 00000000000080a8 000080a8
0000000000000484 0000000000000002 A 4 0 2
[ 7] .gnu.version_r VERNEED 0000000000008530 00008530
00000000000000a0 0000000000000000 A 5 4 8
[ 8] .rela.dyn RELA 00000000000085d0 000085d0
0000000000007230 0000000000000018 A 4 0 8
[ 9] .rela.plt RELA 000000000000f800 0000f800
0000000000001278 0000000000000018 AI 4 24 8
[10] .init PROGBITS 0000000000011000 00011000
0000000000000017 0000000000000000 AX 0 0 4
[11] .plt PROGBITS 0000000000011020 00011020
0000000000000c60 0000000000000010 AX 0 0 16
[12] .plt.got PROGBITS 0000000000011c80 00011c80
0000000000000010 0000000000000008 AX 0 0 8
[13] .text PROGBITS 0000000000011c90 00011c90
000000000017ed45 0000000000000000 AX 0 0 16
[14] .fini PROGBITS 00000000001909d8 001909d8
0000000000000009 0000000000000000 AX 0 0 4
[15] .rodata PROGBITS 0000000000191000 00191000
0000000000034490 0000000000000000 A 0 0 32
[16] .eh_frame_hdr PROGBITS 00000000001c5490 001c5490
000000000000378c 0000000000000000 A 0 0 4
[17] .eh_frame PROGBITS 00000000001c8c20 001c8c20
000000000001adf4 0000000000000000 A 0 0 8
[18] .tbss NOBITS 00000000001e5540 001e4540
0000000000000010 0000000000000000 WAT 0 0 8
[19] .init_array INIT_ARRAY 00000000001e5540 001e4540
0000000000000010 0000000000000008 WA 0 0 8
[20] .fini_array FINI_ARRAY 00000000001e5550 001e4550
0000000000000010 0000000000000008 WA 0 0 8
[21] .data.rel.ro PROGBITS 00000000001e5560 001e4560
00000000000037d0 0000000000000000 WA 0 0 32
[22] .dynamic DYNAMIC 00000000001e8d30 001e7d30
0000000000000280 0000000000000010 WA 5 0 8
[23] .got PROGBITS 00000000001e8fb0 001e7fb0
0000000000000040 0000000000000008 WA 0 0 8
[24] .got.plt PROGBITS 00000000001e9000 001e8000
0000000000000640 0000000000000008 WA 0 0 8
[25] .data PROGBITS 00000000001e9640 001e8640
0000000000000019 0000000000000000 WA 0 0 8
[26] .bss NOBITS 00000000001e9660 001e8659
00000000000022a0 0000000000000000 WA 0 0 32
[27] .comment PROGBITS 0000000000000000 001e8659
0000000000000029 0000000000000001 MS 0 0 1
[28] .gnu_debuglink PROGBITS 0000000000000000 001e8684
000000000000001c 0000000000000000 0 0 4
[29] .shstrtab STRTAB 0000000000000000 001e86a0
0000000000000103 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
l (large), p (processor specific)
反汇编分析
通过上面分析,知道了出错指令在VMA中的偏移为0x00007f24741bfca0 - 0x7f247414e000=0x71ca0,该偏移也是在segment 01中的偏移,因为该segment的起始位置在动态链接库文件中的偏移为0x0000000000011000,所以出错指令在动态链接库文件中的偏移为0x71ca0 + 0x0000000000011000 = 0x82ca0,运行如下命令获取动态链接库的汇编代码:
root@test-System-Product-Name:/home/segment# objdump -DCgl libGLESv2_XXgpu.so.1.1.213621 | more
libGLESv2_XXgpu.so.1.1.213621: file format elf64-x86-64
Disassembly of section .note.gnu.build-id:0000000000000270 <.note.gnu.build-id>:
270: 04 00 add $0x0,%al
272: 00 00 add %al,(%rax)
274: 14 00 adc $0x0,%al
......Disassembly of section .init:
0000000000011000 <_init@@Base>:
11000: 48 83 ec 08 sub $0x8,%rsp
11004: 48 8b 05 b5 7f 1d 00 mov 0x1d7fb5(%rip),%rax # 1e8fc0 <__gmon_start__>
1100b: 48 85 c0 test %rax,%rax
1100e: 74 02 je 11012 <_init@@Base+0x12>
11010: ff d0 callq *%rax
11012: 48 83 c4 08 add $0x8,%rsp
11016: c3 retqDisassembly of section .plt:
0000000000011020 <PVRSRVReleaseDeviceMapping@plt-0x10>:
11020: ff 35 e2 7f 1d 00 pushq 0x1d7fe2(%rip) # 1e9008 <_fini@@Base+0x58630>
11026: ff 25 e4 7f 1d 00 jmpq *0x1d7fe4(%rip) # 1e9010 <_fini@@Base+0x58638>
1102c: 0f 1f 40 00 nopl 0x0(%rax)
......Disassembly of section .plt.got:
0000000000011c80 <__cxa_finalize@plt>:
11c80: ff 25 52 73 1d 00 jmpq *0x1d7352(%rip) # 1e8fd8 <__cxa_finalize@GLIBC_2.2.5>
11c86: 66 90 xchg %ax,%ax
......Disassembly of section .text:
0000000000011c90 <glGetPointerv@@Base-0x42a60>:
11c90: 53 push %rbx
11c91: 8b 9f 10 ad 00 00 mov 0xad10(%rdi),%ebx
11c97: 41 89 d3 mov %edx,%r11d
11c9a: 31 c9 xor %ecx,%ecx
11c9c: 45 31 c9 xor %r9d,%r9d
11c9f: 89 f6 mov %esi,%esi
11ca1: 39 cb cmp %ecx,%ebx......
0000000000082a10 <glFinish@@Base>:
82a10: 41 54 push %r12
82a12: 55 push %rbp
82a13: 53 push %rbx
82a14: 48 8d 3d 95 65 16 00 lea 0x166595(%rip),%rdi # 1e8fb0 <_fini@@Base+0x585d8>
......82c6a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
82c70: 48 8b 83 10 30 00 00 mov 0x3010(%rbx),%rax
82c77: 44 8b 83 00 30 00 00 mov 0x3000(%rbx),%r8d
82c7e: 4c 8d 0d 9e 14 13 00 lea 0x13149e(%rip),%r9 # 1b4123 <_fini@@Base+0x2374b>
82c85: 89 e9 mov %ebp,%ecx
82c87: ba 41 00 00 00 mov $0x41,%edx
82c8c: be 19 00 00 00 mov $0x19,%esi
82c91: 48 8b 38 mov (%rax),%rdi
82c94: 31 c0 xor %eax,%eax
82c96: e8 e5 c1 0b 00 callq 13ee80 <glTexStorage2D@@Base+0x36c0>
82c9b: e9 bc fd ff ff jmpq 82a5c <glFinish@@Base+0x4c>
82ca0: 0f b6 06 movzbl (%rsi),%eax
82ca3: 80 3f 00 cmpb $0x0,(%rdi)
82ca6: 89 c2 mov %eax,%edx
82ca8: 74 25 je 82ccf <glFinish@@Base+0x2bf>
82caa: 84 d2 test %dl,%dl
82cac: b8 ff ff ff ff mov $0xffffffff,%eax
82cb1: 74 1c je 82ccf <glFinish@@Base+0x2bf>
82cb3: 8b 4f 08 mov 0x8(%rdi),%ecx
82cb6: 8b 56 08 mov 0x8(%rsi),%edx
......Disassembly of section .fini:
00000000001909d8 <_fini@@Base>:
1909d8: 48 83 ec 08 sub $0x8,%rsp
1909dc: 48 83 c4 08 add $0x8,%rsp
1909e0: c3 retq
......
红色和蓝色的部分刚好匹配出错时内核打印的code,红色是出错的指令,该指令将rsi寄存器指向的内存的数据赋值给寄存器eax,很明显是rsi寄存器的值出错了,指向了错误的内存。
寻找出错指令对应的源代码
出现错误的指令是属于函数glFinish,查看glFinish源代码和汇编指令,发现glFinish源代码不应该有这么多汇编指令,可能和该动态链接库是release版有关,release包含的符号不全,因为加载动态链接库时如果外部调用了该库的接口,需要知道接口的地址,所以只有对外部的接口的符号信息是必须的,而库内部调用的接口的符号信息不是必需的。
使用gdb进一步分析动态链接库,gdb加载动态链接库后调试信息如下:
root@test-System-Product-Name:/home/shen/segment# gdb ./libGLESv2_XXgpu.so.1.1.213621
GNU gdb (Ubuntu 9.1-0ubuntu1) 9.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./libGLESv2_XXgpu.so.1.1.213621...
(No debugging symbols found in ./libGLESv2_XXgpu.so.1.1.213621)
(gdb) info symbol glFinish
glFinish in section .text(gdb) info symbol CompareVariables
No symbol table is loaded. Use the "file" command.(gdb) list *(0x0000000000082a10)
No symbol table is loaded. Use the "file" command.(gdb) list *(0x0000000000082ca0)
No symbol table is loaded. Use the "file" command.(gdb) disassemble glFinish
Dump of assembler code for function glFinish:
0x0000000000082a10 <+0>: push %r12
0x0000000000082a12 <+2>: push %rbp
0x0000000000082a13 <+3>: push %rbx
0x0000000000082a14 <+4>: lea 0x166595(%rip),%rdi # 0x1e8fb0
0x0000000000082a1b <+11>: callq 0x11560 <__tls_get_addr@plt>
0x0000000000082a20 <+16>: mov 0x8(%rax),%rbx
0x0000000000082a27 <+23>: test %rbx,%rbx
.......
0x0000000000082c77 <+615>: mov 0x3000(%rbx),%r8d
0x0000000000082c7e <+622>: lea 0x13149e(%rip),%r9 # 0x1b4123
0x0000000000082c85 <+629>: mov %ebp,%ecx
0x0000000000082c87 <+631>: mov $0x41,%edx
0x0000000000082c8c <+636>: mov $0x19,%esi
0x0000000000082c91 <+641>: mov (%rax),%rdi
0x0000000000082c94 <+644>: xor %eax,%eax
0x0000000000082c96 <+646>: callq 0x13ee80
0x0000000000082c9b <+651>: jmpq 0x82a5c <glFinish+76> //参考上面jmpq刚好是5个字节,该指令后地址即0x82ca0
End of assembler dump.(gdb) disassemble 0x0000000000082a10
Dump of assembler code for function glFinish:
0x0000000000082a10 <+0>: push %r12
0x0000000000082a12 <+2>: push %rbp
0x0000000000082a13 <+3>: push %rbx
0x0000000000082a14 <+4>: lea 0x166595(%rip),%rdi # 0x1e8fb0
0x0000000000082a1b <+11>: callq 0x11560 <__tls_get_addr@plt>
0x0000000000082a20 <+16>: mov 0x8(%rax),%rbx
0x0000000000082a27 <+23>: test %rbx,%rbx
.....
0x0000000000082c77 <+615>: mov 0x3000(%rbx),%r8d
0x0000000000082c7e <+622>: lea 0x13149e(%rip),%r9 # 0x1b4123
0x0000000000082c85 <+629>: mov %ebp,%ecx
0x0000000000082c87 <+631>: mov $0x41,%edx
0x0000000000082c8c <+636>: mov $0x19,%esi
0x0000000000082c91 <+641>: mov (%rax),%rdi
0x0000000000082c94 <+644>: xor %eax,%eax
0x0000000000082c96 <+646>: callq 0x13ee80
0x0000000000082c9b <+651>: jmpq 0x82a5c <glFinish+76>
End of assembler dump.(gdb) disassemble 0x0000000000082ca0
No function contains specified address.
反汇编函数glFinish,到0x0000000000082c9b+jump 5字节=0x0000000000082ca0就结束了。运行命令disassemble 0x0000000000082ca0,提示"No function contains specified address."。
运行命令add-symbol-file libGLESv2_XXgpu.dbg加载debug symbol file,调试信息如下:
(gdb) add-symbol-file libGLESv2_XXgpu.dbg
add symbol table from file "libGLESv2_XXgpu.dbg"
(y or n) y
Reading symbols from libGLESv2_XXgpu.dbg...
(gdb) disassemble 0x0000000000082ca0
Dump of assembler code for function CompareVariables:
0x0000000000082ca0 <+0>: movzbl (%rsi),%eax
0x0000000000082ca3 <+3>: cmpb $0x0,(%rdi)
0x0000000000082ca6 <+6>: mov %eax,%edx
0x0000000000082ca8 <+8>: je 0x82ccf <CompareVariables+47>
0x0000000000082caa <+10>: test %dl,%dl
0x0000000000082cac <+12>: mov $0xffffffff,%eax
0x0000000000082cb1 <+17>: je 0x82ccf <CompareVariables+47>
0x0000000000082cb3 <+19>: mov 0x8(%rdi),%ecx
0x0000000000082cb6 <+22>: mov 0x8(%rsi),%edx
0x0000000000082cb9 <+25>: add $0x1,%ecx
0x0000000000082cbc <+28>: add $0x1,%edx
0x0000000000082cbf <+31>: sub 0x4(%rdi),%ecx
0x0000000000082cc2 <+34>: sub 0x4(%rsi),%edx
0x0000000000082cc5 <+37>: cmp %edx,%ecx
0x0000000000082cc7 <+39>: ja 0x82ccf <CompareVariables+47>
0x0000000000082cc9 <+41>: setb %al
0x0000000000082ccc <+44>: movzbl %al,%eax
0x0000000000082ccf <+47>: repz retq
End of assembler dump.
(gdb) list *(0x0000000000082ca0)
0x82ca0 is in CompareVariables (compiler/psc/inst.c:607).
602 compiler/psc/inst.c: No such file or directory.(gdb) info symbol CompareVariables
CompareVariables in section .text of /home/segment/libGLESv2_XXgpu.dbg
(gdb) info symbol CompareVariables
CompareVariables in section .text of /home/segment/libGLESv2_XXgpu.dbg
(gdb) list CompareVariables
602 compiler/psc/inst.c: No such file or directory.
0x0000000000082ca0刚好是函数CompareVariables的第一句指令,因为调试环境没有源代码,所以指令list *(0x0000000000082ca0)提示"compiler/psc/inst.c: No such file or directory."。
函数CompareVariables源代码如下:
typedef struct tagPSC_VARIABLE {
IMG_BOOL bInUse;
IMG_UINT32 ui32FirstID;
IMG_UINT32 ui32LastID;
IMG_UINT32 ui32AlignmentInDW;
IMG_UINT32 ui32LifetimeStart;
IMG_UINT32 ui32LifetimeEnd;
IMG_UINT32 ui32FirstHwReg;
IMG_UINT32 ui32LastHwReg;
} PSC_VARIABLE;
typedef struct tagPSC_CONTEXT
{
......
PSC_VARIABLE *psVariables;
......
LABEL_LOCATION *psLabels;
LABEL_REQUEST *psLabelRequests;
......
}PSC_CONTEXT;
static int C_CALLCONV CompareVariables(void const *pvLhs, void const *pvRhs)
{
PSC_VARIABLE const *psLhs = pvLhs;
PSC_VARIABLE const *psRhs = pvRhs;
if (psLhs->bInUse && psRhs->bInUse)
{
IMG_UINT32 ui32LhsSize = psLhs->ui32LastID - psLhs->ui32FirstID + 1;
IMG_UINT32 ui32RhsSize = psRhs->ui32LastID - psRhs->ui32FirstID + 1;
if (ui32LhsSize > ui32RhsSize)
{
return -1;
}
else if (ui32LhsSize < ui32RhsSize)
{
return 1;
}
else
{
return 0;
}
}
else if (psLhs->bInUse)
{
return -1;
}
else if (psRhs->bInUse)
{
return 1;
}
return 0;
}
//只有函数CompilePreAmble会调用CompareVariables
XX_INTERNAL void CompilePreAmble(PPSC_CONTEXT psContext)
{
......
/*
* Sort the variables from largest to smallest and count how many there are.
*/
if (psContext->ui32VariablesCapacity > 0)
{
qsort(psContext->psVariables, psContext->ui32VariablesCapacity, sizeof(PSC_VARIABLE), CompareVariables);
}
......
}
对照CompareVariables源代码,分析CompareVariables的汇编指令如下所示:
AT&T汇编和Intel汇编有区别,比如源操作数和目的操作数位置相反,Intel语法中第一个是目的操作数,第二个是源操作数,下面为AT&T汇编:
(gdb) disassemble 0x0000000000082ca0
Dump of assembler code for function CompareVariables:
0x0000000000082ca0 <+0>: movzbl (%rsi),%eax //(%rsi)即psRhs->bInUse
0x0000000000082ca3 <+3>: cmpb $0x0,(%rdi) //(%rdi)即psLhs->bInUse
0x0000000000082ca6 <+6>: mov %eax,%edx //edx设置为psRhs->bInUse
0x0000000000082ca8 <+8>: je 0x82ccf <CompareVariables+47> //判断psLhs->bInUse是否等于0,如果等于0则跳转
0x0000000000082caa <+10>: test %dl,%dl //将psRhs->bInUse与psRhs->bInUse进行与操作,如果等于0则设置标志寄存器的ZF=1;否则设置ZF=0
0x0000000000082cac <+12>: mov $0xffffffff,%eax //eax -1
0x0000000000082cb1 <+17>: je 0x82ccf <CompareVariables+47> //如果psRhs->bInUse等于0(即ZF==1)则跳转,且返回值eax在上一条指令已经被设置为-1
0x0000000000082cb3 <+19>: mov 0x8(%rdi),%ecx //0x8(%rdi)即psLhs->ui32LastID
0x0000000000082cb6 <+22>: mov 0x8(%rsi),%edx //0x8(%rsi)即psRhs->ui32LastID
0x0000000000082cb9 <+25>: add $0x1,%ecx
0x0000000000082cbc <+28>: add $0x1,%edx
0x0000000000082cbf <+31>: sub 0x4(%rdi),%ecx //0x4(%rdi)即psLhs->ui32FirstID,对应 ui32LhsSize = psLhs->ui32LastID - psLhs->ui32FirstID + 1;
0x0000000000082cc2 <+34>: sub 0x4(%rsi),%edx //0x4(%rsi)即psRhs->ui32FirstID,对应 ui32RhsSize = psRhs->ui32LastID - psRhs->ui32FirstID + 1;
0x0000000000082cc5 <+37>: cmp %edx,%ecx // edx是源操作数,ecx是目的操作数
0x0000000000082cc7 <+39>: ja 0x82ccf <CompareVariables+47> //如果ecx > edx,则跳转
0x0000000000082cc9 <+41>: setb %al //根据前面的CMP结果来设置al,如果ecx < edx,则设置al为1;否则设置al为0
0x0000000000082ccc <+44>: movzbl %al,%eax //设置eax,eax为函数返回值
0x0000000000082ccf <+47>: repz retq
End of assembler dump.
出错的指令将psRhs->bInUse赋值给寄存器eax,所以访问指针psRhs指向的地址的数据时出现了错误,推测psRhs指针值出现了错误,结合内核打印信息psRhs的值就是0x1f2e8。
因为出现错误时没有堆栈、寄存器值等信息,所以无法继续深入分析汇编指令。只能分析调用CompareVariables相关的代码,看是否有问题,分析所有调用CompareVariables的流程,未发现问题,怀疑可能是内存被踩了。
通知测试人员打开生成coredump的功能,后续问题复现后继续分析。因为只有函数CompilePreAmble会调用CompareVariables,pvRhs就是指针psContext->psVariables指向的数组中的某个元素,有coredump文件后可以通过出现错误时的寄存器值倒推出psContext的值,然后可以查看psContext的其他成员是否正常,如果不正常可能就是内存被踩了。
补充
查看进程的动态链接库的地址
使用命令cat /proc/[proccess id]/maps可以查看进程使用的动态链接库在进程中的虚拟地址范围,其中每一行对应内核中的一个VMA(各个VMA都不相同):
root@test-System-Product-Name:/home# cat /proc/50859/maps
55e0c4b51000-55e0c4b5c000 r--p 00000000 08:02 21759090 /usr/sbin/sshd
55e0c4b5c000-55e0c4bdb000 r-xp 0000b000 08:02 21759090 /usr/sbin/sshd
55e0c4bdb000-55e0c4c23000 r--p 0008a000 08:02 21759090 /usr/sbin/sshd
55e0c4c23000-55e0c4c27000 r--p 000d1000 08:02 21759090 /usr/sbin/sshd
55e0c4c27000-55e0c4c28000 rw-p 000d5000 08:02 21759090 /usr/sbin/sshd
55e0c4c28000-55e0c4c2d000 rw-p 00000000 00:00 0
55e0c609b000-55e0c6149000 rw-p 00000000 00:00 0 [heap]
7f28e424c000-7f28e4252000 r--p 00000000 08:02 21760411 /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7f28e4252000-7f28e4278000 r-xp 00006000 08:02 21760411 /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7f28e4278000-7f28e4284000 r--p 0002c000 08:02 21760411 /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7f28e4284000-7f28e4287000 r--p 00037000 08:02 21760411 /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7f28e4287000-7f28e4288000 rw-p 0003a000 08:02 21760411 /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
7f28e42e5000-7f28e42e7000 r--p 00000000 08:02 22417503 /usr/lib/x86_64-linux-gnu/security/pam_gnome_keyring.so
7f28e42e7000-7f28e42ed000 r-xp 00002000 08:02 22417503 /usr/lib/x86_64-linux-gnu/security/pam_gnome_keyring.so
7f28e42ed000-7f28e42f0000 r--p 00008000 08:02 22417503 /usr/lib/x86_64-linux-gnu/security/pam_gnome_keyring.so
7f28e42f0000-7f28e42f1000 r--p 0000a000 08:02 22417503 /usr/lib/x86_64-linux-gnu/security/pam_gnome_keyring.so
7f28e42f1000-7f28e42f2000 rw-p 0000b000 08:02 22417503 /usr/lib/x86_64-linux-gnu/security/pam_gnome_keyring.so
......
gdb加载debug symbol file到指定地址
使用gdb调试的时候,如果动态链接库是release版本,可以通过如下方式加载debug symbol file:
gdb) info sharedlibrary
From To Syms Read Shared Object Library
0x0000fffff7fcd0c0 0x0000fffff7fe5468 Yes (*) /lib/ld-linux-aarch64.so.1
0x0000fffff7f9f890 0x0000fffff7fb65c0 Yes (*) /usr/local/lib/libtest.so.1
0x0000fffff7e4bbc0 0x0000fffff7f3b190 Yes /lib/aarch64-linux-gnu/libc.so.6
0x0000fffff7dfea50 0x0000fffff7e0ddec Yes /lib/aarch64-linux-gnu/libpthread.so.0Load the symbol file with the address from the share library
(gdb) add-symbol-file ./libsrc/libtest.dbg 0x0000fffff7f9f890
add symbol table from file "./libsrc/libtest.dbg" at
.text_addr = 0xfffff7f9f890
(y or n) y
Reading symbols from ./libsrc/libtest.dbg...
其中0x0000fffff7f9f890就是动态库的加载地址,可以通过info sharedlibrary查到。