|
|
|
@ -0,0 +1,230 @@
|
|
|
|
|
+++
|
|
|
|
|
title = "Dynamic linking madness: solving a bug in go-nvml"
|
|
|
|
|
date = "2025-02-15"
|
|
|
|
|
author = "Braydon Kains"
|
|
|
|
|
+++
|
|
|
|
|
|
|
|
|
|
I work on open source observability software, primarily the [Google Cloud Ops Agent](https://github.com/GoogleCloudPlatform/ops-agent), [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/), and [Fluent Bit](https://github.com/fluent/fluent-bit).
|
|
|
|
|
Over the past few years, I have gained an affinity for taking on the types of deep issues that have me journeying as deep into the weeds as I can get. In this post I'm going to go over one of those issues, perhaps partially to self-document everything I learned but also because I think it was an interesting journey worth writing down.
|
|
|
|
|
|
|
|
|
|
## The Issue: go-nvml crashes our OpenTelemetry Collector
|
|
|
|
|
|
|
|
|
|
One of the features of the Ops Agent is GPU Monitoring; if you install the Ops Agent on a GCE VM with a GPU, you will automatically get metrics for it through the [NVIDIA Management Library (NVML)](https://developer.nvidia.com/nvidia-management-library-nvml), and optionally through [DCGM](https://developer.nvidia.com/dcgm). To achieve this, we built specific instrumentation using the [Go bindings for NVML](https://github.com/NVIDIA/go-nvml) and for DCGM.
|
|
|
|
|
|
|
|
|
|
We learned when attempting to upgrade our build of the Collector to Go 1.21 that the Collector would crash on startup if a GPU was present on the machine. It produced the kind of panic you wouldn't usually be used to seeing in a Go program:
|
|
|
|
|
```
|
|
|
|
|
SIGSEGV: segmentation violation
|
|
|
|
|
PC=0x0 m=0 sigcode=1
|
|
|
|
|
signal arrived during cgo execution
|
|
|
|
|
```
|
|
|
|
|
Seeing `PC=0x0` was very surprising to me. I had no idea how this sort of thing could occur in a Go program, even with CGO. Even more strange was that this crash was only happening on certain systems. How could something like a segfault be system dependent?
|
|
|
|
|
I was absolutely hooked. I would not rest until I understood why this could possibly be happening.
|
|
|
|
|
|
|
|
|
|
You can read [the original issue in go-nvml](https://github.com/NVIDIA/go-nvml/issues/36) and [the issue I opened in golang/go][golang github issue] to see the real discussions, or read on for my direct retelling.
|
|
|
|
|
|
|
|
|
|
## Intro to dynamic libraries
|
|
|
|
|
|
|
|
|
|
This is information that I feel is important to understand the underlying issue. If you are already familiar with how dynamic libraries are loaded, you can skip to [How go-nvml works](#how-go-nvml-works).
|
|
|
|
|
|
|
|
|
|
### Dynamic vs Static Linking
|
|
|
|
|
|
|
|
|
|
In C and adjacent languages, there are two ways to link a library to your application: static, and dynamic. Static linking is pretty straightforward; the library code is included at compile-time, and when the library is compiled into an object, it is then linked directly into the resulting binary. When the compiled program is run and something from the library is referenced, the implementation is already present within the binary. With dynamic linking, rather than the libraries being built directly into the binary, the libraries are simply referenced by the application to then be loaded at runtime. These will be `.so` on Linux or `.dll` on Windows. When the application is run, the operating system receives instructions to look for the libraries on the system, and if they are found they are loaded for the program to use, or if not found the program fails to start.
|
|
|
|
|
|
|
|
|
|
Static linking sure does sound great, right? There's not much to think about there, the code is just included in the binary rather than needing to worry about having specific dynamic libraries on the system. Why wouldn't you always do that? Golang agrees with you; all binaries built with pure Go are completely statically linked. This is actually a selling point of the language, and as an avid user of it I can feel the benefits. It is so nice to build a giant Go program, and just have one nice clean binary at the end with everything the binary needs. As someone working on a [tool written in Go](https://github.com/google/yamlfmt), I love that building and distributing it is so dead simple because it's one statically linked binary. No separate instructions that certain libraries have to be `apt install`ed onto the system, or being forced to distribute a container image for the tool to be usable.
|
|
|
|
|
|
|
|
|
|
Dynamic linking does have a purpose though, especially when writing lower level applications. One of the most popular ones is C runtime libraries, an implementation of which is available on any Linux distribution, or can be installed on Windows through the `Visual C++ Redistributable` (something I'm sure many gamers have installed and not really known why). C runtimes can be statically linked in most compilers, however it often doesn't make much sense to statically link something that is available on most any system the application will run on. One of the biggest reasons is binary sizes. I've seen people online be quite confused at the size of a simple Go Hello World program exceeding a megabyte (at least at the time), but the reason for this is that Go does indeed statically link its runtime with the binary which baloons the size of the binary.
|
|
|
|
|
|
|
|
|
|
Large binaries with lots of static linked libraries has other complications as well, such as the amount of memory the program can take to run. I'd like to write a separate blog post about this at some point, but in short, large statically linked binaries can take more memory to run because loading the binary instructions and data in the first place takes up more space in RAM. The difference with dynamically loading libraries is that the memory the libary takes up in memory can be shared by any other processes using the library. So if we just take dynamically linking `libc` as an example, there are probably tons of other applications on the system also dynamically loading libc and all sharing that memory in RAM. If all those same binaries had statically linked `libc`, then they would each have a private copy of `libc` with all the space in memory that would take up and would be unable to share with any other processes on the system.
|
|
|
|
|
|
|
|
|
|
### Dynamic Loading
|
|
|
|
|
|
|
|
|
|
The other way to interact with dynamic libraries is by loading them explicitly. With dynamic linking, the required libraries are built into the binary for the system to discover when the program is loaded. However, sometimes the exact library to be used can't be known at compile time. There may be multiple versions of the library that the program is built to work with, and there needs to be some logic done at runtime to determine exactly which library is loaded. This is common with versioned APIs, where there may be `v2` versions of functions present in dynamic libraries (rather than just reimplementing the functions so that backwards compatibility can be maintained, which is really important for dynamic libraries).
|
|
|
|
|
So the alternative method is loading the libraries at runtime using `dlopen` in Linux, or `LoadLibrary` in Windows. This gives you a handle to the libary loaded into program memory, and to find symbols in it you can look them up in the loaded library using `dlsym` in Linux or `GetProcAddress` in Windows.
|
|
|
|
|
|
|
|
|
|
### Exporting Dynamic Symbols (Linux ELF binaries)
|
|
|
|
|
|
|
|
|
|
We have now exceeded my knowledge of how this might work in Windows, so this section is specific to ELF binaries on Linux.
|
|
|
|
|
|
|
|
|
|
What typically happens in the linking step is the linker maintains all external references to dynamic symbols in two sections of the binary called the PLT (Procedure Linkage Table) and the GOT (Global Offset Table). The PLT maintains references to all dynamic symbols used, while the GOT maintains the actual address of known dynamic symbols. Upon usage of a dynamic symbol, the compiler references the PLT entry for that symbol. At the linking stage, the linker will add those known symbols to the GOT. At runtime, when a PLT entry is called, it will look for an entry in the GOT and jump to that address, otherwise it willtry to resolve the symbol manually.
|
|
|
|
|
|
|
|
|
|
Let's see this in action with a very simple C program:
|
|
|
|
|
```c
|
|
|
|
|
#include <stdio.h>
|
|
|
|
|
|
|
|
|
|
int main() {
|
|
|
|
|
printf("hi\n");
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
I'll compile the binary with `gcc` and immediately disassemble it:
|
|
|
|
|
```
|
|
|
|
|
$ make
|
|
|
|
|
gcc -o hello -g -Wall main.c
|
|
|
|
|
$ objdump -d hello > hello.s
|
|
|
|
|
```
|
|
|
|
|
Let's navigate the dump to the `main` subroutine:
|
|
|
|
|
```
|
|
|
|
|
0000000000001149 <main>:
|
|
|
|
|
1149: f3 0f 1e fa endbr64
|
|
|
|
|
114d: 55 push %rbp
|
|
|
|
|
114e: 48 89 e5 mov %rsp,%rbp
|
|
|
|
|
1151: 48 8d 3d ac 0e 00 00 lea 0xeac(%rip),%rdi # 2004 <_IO_stdin_used+0x4>
|
|
|
|
|
1158: e8 f3 fe ff ff call 1050 <puts@plt>
|
|
|
|
|
115d: b8 00 00 00 00 mov $0x0,%eax
|
|
|
|
|
1162: 5d pop %rbp
|
|
|
|
|
1163: c3 ret
|
|
|
|
|
1164: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
|
|
|
|
|
116b: 00 00 00
|
|
|
|
|
116e: 66 90 xchg %ax,%ax
|
|
|
|
|
```
|
|
|
|
|
What we care about here is instruction `1158`, with the call to `puts@plt`. This is a reference to a symbol `puts` in the PLT, which is a result of us calling `printf` from `stdio.h` in our program.
|
|
|
|
|
|
|
|
|
|
In the dump we can also analyze the disassembly of the `plt`:
|
|
|
|
|
```
|
|
|
|
|
Disassembly of section .plt:
|
|
|
|
|
|
|
|
|
|
0000000000001020 <.plt>:
|
|
|
|
|
1020: ff 35 9a 2f 00 00 push 0x2f9a(%rip) # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x8>
|
|
|
|
|
1026: ff 25 9c 2f 00 00 jmp *0x2f9c(%rip) # 3fc8 <_GLOBAL_OFFSET_TABLE_+0x10>
|
|
|
|
|
102c: 0f 1f 40 00 nopl 0x0(%rax)
|
|
|
|
|
1030: f3 0f 1e fa endbr64
|
|
|
|
|
1034: 68 00 00 00 00 push $0x0
|
|
|
|
|
1039: e9 e2 ff ff ff jmp 1020 <_init+0x20>
|
|
|
|
|
103e: 66 90 xchg %ax,%ax
|
|
|
|
|
|
|
|
|
|
Disassembly of section .plt.got:
|
|
|
|
|
|
|
|
|
|
0000000000001040 <__cxa_finalize@plt>:
|
|
|
|
|
1040: f3 0f 1e fa endbr64
|
|
|
|
|
1044: ff 25 ae 2f 00 00 jmp *0x2fae(%rip) # 3ff8 <__cxa_finalize@GLIBC_2.2.5>
|
|
|
|
|
104a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
|
|
|
|
|
|
|
|
|
|
Disassembly of section .plt.sec:
|
|
|
|
|
|
|
|
|
|
0000000000001050 <puts@plt>:
|
|
|
|
|
1050: f3 0f 1e fa endbr64
|
|
|
|
|
1054: ff 25 76 2f 00 00 jmp *0x2f76(%rip) # 3fd0 <puts@GLIBC_2.2.5>
|
|
|
|
|
105a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
|
|
|
|
|
```
|
|
|
|
|
We can see that `puts@plt` ends up doing a jump to address `0x2f76`, the location of that symbol from `GLIBC_2.2.5`.
|
|
|
|
|
|
|
|
|
|
All of this will be important when we get to the bug itself, so I hope you stayed awake!
|
|
|
|
|
|
|
|
|
|
## How go-nvml works
|
|
|
|
|
|
|
|
|
|
The Go NVML bindings are an interesting challenge. NVML is a closed source library, and the intended usage is to link to the shared object on the system using a public header. So the way the Go NVML bindings work is as follows:
|
|
|
|
|
|
|
|
|
|
1. Provide a copy of the [NVML header](https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/nvml.h)
|
|
|
|
|
2. Using a 3rd party tool called [c-for-go](https://c.for-go.com/) generate a set of Go bindings
|
|
|
|
|
3. Wrap the Go bindings in a light API layer for user friendliness
|
|
|
|
|
|
|
|
|
|
The function that was segfaulting was actually the first function, `nvmlInit`. So let's look at the process of loading this function:
|
|
|
|
|
|
|
|
|
|
1. The library `libnvidia-ml.so.1` is loaded using `dlopen` with the flags `RTLD_LAZY | RTLD_GLOBAL`.
|
|
|
|
|
2. Much of the API is versioned in the library, so each of the versioned APIs are search in the loaded library using `dlsym`. If the v2 version of a symbol is present, then the bindings are told to use the v2 version of the symbol. In our case, we are using an NVML library that's new enough to have `nvmlInit_v2`, so we will end up using that symbol.
|
|
|
|
|
3. Each of these symbols is wrapped with an exported Go function, that loads the library and checks for errors before calling into the generated bindings. So we would call `nvml.Init()` in our Go code.
|
|
|
|
|
4. This would lead to the generated bindings, which are what actually calls into CGO using `import "C"` and calls `C.nvmlInit_v2()`.
|
|
|
|
|
|
|
|
|
|
## The Bug
|
|
|
|
|
|
|
|
|
|
A considerable amount of time has passed since this investigation took place, so I am writing with a ton of hindsight here. This explanation will obscure a ton of straw-grapsing, which you can look through in the [Go GitHub issue I opened][golang github issue]. For the sake of this post though, I'm going to skip to the part where it all came together and the issue and solution became clear.
|
|
|
|
|
|
|
|
|
|
Ignoring the deep inner workings of how the NVML Go bindings work, I will focus on the most important core of it. This project generates C bindings based on an [input header file](https://github.com/NVIDIA/go-nvml/blob/v0.12.0-1/gen/nvml/nvml.h). This header file represents the accessible API for `libnvidia-ml.so.1`, a proprietary binary that is expected to be installed on the user's machine and loaded at runtime. It is not provided as part of the binding package, and will not be linked as a part of the build. To deal with this, the linker flag `--unresolved-symbols=ignore-in-object-files` is [passed to the linker as part of the bindings](https://github.com/NVIDIA/go-nvml/blob/v0.12.0-1/pkg/nvml/nvml.go#L21). This flag makes it so the symbols from `nvml.h`, which are not going to be resolved in the build with the shared object missing, will be ignored by the linker and not considered an error.
|
|
|
|
|
|
|
|
|
|
Our initial knowledge was that the bug occurred under the following circumstances:
|
|
|
|
|
1. Using Go 1.21
|
|
|
|
|
2. Building on Ubuntu Jammy or newer, but not on earlier distros like Debian 10 Buster
|
|
|
|
|
|
|
|
|
|
While at this point in the investigation a lot of these concepts were somewhat new to me, I did have a feeling that given the issue was with a dynamic library loaded through CGO, the issue probably had something to do with linking, and I suspected the version of `ld` on the system was the culprit, and that something in the CGO layer of Go had changed in conflict with a new version of `ld`. It took me a non-trivial amount of time to realize why, but this ended up mostly correct.
|
|
|
|
|
|
|
|
|
|
### Standalone Repro
|
|
|
|
|
|
|
|
|
|
In order to a) determine whether this was `go-nvml` specific or something inherent to Go, and b) to not require me to have NVIDIA libraries installed while developing, I created a [standalone reproduction][cgo_dl_repro]. This confirmed that setting up a small CGO program under the same circumstances (providing a header but no object and passing `--unresolved-symbols=ignore-in-object-files` to `ld`) panicked in the exact same way. We can work with this from here on out.
|
|
|
|
|
|
|
|
|
|
### Comparing Go 1.20 to 1.21
|
|
|
|
|
|
|
|
|
|
Using the reproduction, I will build 2 binaries, one with Go 1.20 and one with Go 1.21.
|
|
|
|
|
|
|
|
|
|
The repro program includes a header that defines a function `get42` and makes a call to it. This symbol should be unresolved in the build, and should show up as such in our binary. If we use `nm` on the Go 1.20 binary, we can find our `get42` existing as expected as an unresolved symbol:
|
|
|
|
|
```
|
|
|
|
|
$ nm cgo_dl_repro_go120 | grep get42
|
|
|
|
|
0000000000483760 T _cgo_49665a31f432_Cfunc_get42
|
|
|
|
|
U get42
|
|
|
|
|
0000000000483580 t main._Cfunc_get42.abi0
|
|
|
|
|
000000000051b1c8 d main._cgo_49665a31f432_Cfunc_get42
|
|
|
|
|
```
|
|
|
|
|
However, checking out the Go 1.21 binary shows an important difference, which is that this symbol is missing!
|
|
|
|
|
```
|
|
|
|
|
nm cgo_dl_repro_go121 | grep get42
|
|
|
|
|
000000000047ce70 T _cgo_49665a31f432_Cfunc_get42
|
|
|
|
|
000000000047cca0 t main._Cfunc_get42.abi0
|
|
|
|
|
000000000051b1a8 d main._cgo_49665a31f432_Cfunc_get42
|
|
|
|
|
```
|
|
|
|
|
The only `get42` symbols are the CGO calls we make in the Go code and the symbol from the C code that CGO generates.
|
|
|
|
|
|
|
|
|
|
I did not fully grasp what I was looking at when I found this, but this turned out to be the important difference. The `get42` unresolved symbol being missing actually meant that the `get42` symbol **did not have an entry in the PLT**. This results in Go generating assembly for this program that looks like this (disassembled by `go tool objdump`):
|
|
|
|
|
```
|
|
|
|
|
TEXT _cgo_49665a31f432_Cfunc_get42(SB)
|
|
|
|
|
:0 0x47ce70 4154 PUSHQ R12
|
|
|
|
|
:0 0x47ce72 55 PUSHQ BP
|
|
|
|
|
:0 0x47ce73 53 PUSHQ BX
|
|
|
|
|
:0 0x47ce74 4889fb MOVQ DI, BX
|
|
|
|
|
:0 0x47ce77 e88416feff CALL _cgo_topofstack(SB)
|
|
|
|
|
:0 0x47ce7c 4989c4 MOVQ AX, R12
|
|
|
|
|
:0 0x47ce7f 31c0 XORL AX, AX
|
|
|
|
|
:0 0x47ce81 e87a31b8ff CALL 0x0 <-- EVIL!!!!
|
|
|
|
|
:0 0x47ce86 89c5 MOVL AX, BP
|
|
|
|
|
:0 0x47ce88 e87316feff CALL _cgo_topofstack(SB)
|
|
|
|
|
:0 0x47ce8d 4c29e0 SUBQ R12, AX
|
|
|
|
|
:0 0x47ce90 892c03 MOVL BP, 0(BX)(AX*1)
|
|
|
|
|
:0 0x47ce93 5b POPQ BX
|
|
|
|
|
:0 0x47ce94 5d POPQ BP
|
|
|
|
|
:0 0x47ce95 415c POPQ R12
|
|
|
|
|
:0 0x47ce97 c3 RET
|
|
|
|
|
```
|
|
|
|
|
And a reminder of what that panic looks like:
|
|
|
|
|
```
|
|
|
|
|
SIGSEGV: segmentation violation
|
|
|
|
|
PC=0x0 m=0 sigcode=1
|
|
|
|
|
signal arrived during cgo execution
|
|
|
|
|
```
|
|
|
|
|
That explains how we're getting program counter `0x0`!
|
|
|
|
|
|
|
|
|
|
### The Solution
|
|
|
|
|
|
|
|
|
|
While I spent a considerable amount of time experimenting and looking through `go tool linker` and `cgo` source code to try and understand what was going on, and I did learn a lot, I ended up finding the problem with a good old fashioned `git bisect`. I ended up at commit [1f29f39](https://github.com/golang/go/commit/1f29f39795e736238200840c368c4e0c6edbfbae).
|
|
|
|
|
The message of that commit: `cmd/link: don't export all symbols for ELF external linking`
|
|
|
|
|
The problematic code change was from this:
|
|
|
|
|
```go
|
|
|
|
|
// Force global symbols to be exported for dlopen, etc.
|
|
|
|
|
if ctxt.IsELF {
|
|
|
|
|
argv = append(argv, "-rdynamic")
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
To this:
|
|
|
|
|
```go
|
|
|
|
|
// Force global symbols to be exported for dlopen, etc.
|
|
|
|
|
if ctxt.IsELF {
|
|
|
|
|
if ctxt.DynlinkingGo() || ctxt.BuildMode == BuildModeCShared || !linkerFlagSupported(ctxt.Arch, argv[0], altLinker, "-Wl,--export-dynamic-symbol=main") {
|
|
|
|
|
argv = append(argv, "-rdynamic")
|
|
|
|
|
} else {
|
|
|
|
|
ctxt.loader.ForAllCgoExportDynamic(func(s loader.Sym) {
|
|
|
|
|
argv = append(argv, "-Wl,--export-dynamic-symbol="+ctxt.loader.SymExtname(s))
|
|
|
|
|
})
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
```
|
|
|
|
|
What does this mean? The code used to always pass the `-rdynamic` flag to `gcc`, which passes `--export-dynamic` to `ld` under the hood. The change for the code changed to only pass `-rdynamic` to `gcc` if the particular linker flag is not supported. The justification for this is in [this issue](https://github.com/golang/go/issues/53579) (TL;DR because this is unnecessary in most cases it wastes space on a majority of binaries). While it's hard to know exactly when the `--export-dynamic-symbol` flag was added to `ld`, but it seems like the only plausible reason that this issue only occurs on an `ld` version that is high enough.
|
|
|
|
|
|
|
|
|
|
Since `-rdynamic` is now not always being passed in the CGO build process, the change I ended up on was to modify the binding generation in `go-nvml` to [always pass the `--export-dynamic` linker flag](https://github.com/NVIDIA/go-nvml/pull/79). This doesn't break if the `-rdynamic` flag is passed, but ensures that we still have the required `ld` flag being passed in newer versions of Go.
|
|
|
|
|
|
|
|
|
|
## Conclusion
|
|
|
|
|
|
|
|
|
|
This was a very hard issue to figure out, and was around a week's worth of effort. The solution was 16 characters. This is why it's hard to measure coding productivity by raw output! :)
|
|
|
|
|
|
|
|
|
|
I'm still glad I went through all of it, and glad I went through the process of re-documenting it by writing up this post. Hopefully you got some enjoyment out of my adventure!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[cgo_dl_repro]: https://github.com/braydonk/cgo_dl_repro
|
|
|
|
|
[golang github issue]: https://github.com/golang/go/issues/63264
|
|
|
|
|
|