NERRF
Tracker

Implementation Guide

Deep dive into Tracker's architecture, design patterns, and internal implementation

Code Organization

tracker/
├── bpf/
│   └── tracepoints.c          # 81 lines of kernel code
├── cmd/tracker/
│   └── main.go                # 331 lines of Go userspace
├── pkg/
│   ├── bpf/
│   │   └── loader.go          # libbpf Go wrapper
│   └── pb/
│       ├── trace.pb.go        # Generated protobuf code
│       └── trace_grpc.pb.go   # Generated gRPC code
├── proto/
│   └── trace.proto            # 50 lines of service definition
├── scripts/
│   └── install-deps.sh        # One-liner dependency setup
├── Dockerfile.minimal         # 50 MB image
├── Makefile                   # Build orchestration
└── README.md                  # Quick-start

eBPF Layer (tracepoints.c)

Event Struct Definition

struct event {
    __u64 ts;              // Kernel monotonic timestamp (nanoseconds)
    __u32 pid;             // Process ID (from kernel context)
    __u32 tid;             // Thread ID (from kernel context)
    char comm[16];         // Executable name (kernel-provided)
    __u32 syscall_id;      // 1=openat, 2=write, 3=rename
    __s64 ret_val;         // System call return value
    __u64 bytes;           // Bytes written (write syscall only)
    char path[256];        // File path (user-provided arg)
    char new_path[256];    // Destination path (rename only)
};

Memory Layout: 600 bytes (fits in ring buffer slot)

Ring Buffer Map

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256 KB
} events SEC(".maps");

Why BPF_MAP_TYPE_RINGBUF?

  • Single reader model (perfect for streaming to one userspace daemon)
  • Atomic reservation without spinlocks
  • Automatic wraparound (no manual buffer management)
  • Memory-efficient: each writer claims a slot atomically

Tracepoint Attachment Pattern

SEC("tracepoint/syscalls/sys_enter_openat")
int trace_openat(struct trace_event_raw_sys_enter *ctx) {
    // 1. Reserve event slot atomically
    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;  // Buffer full, drop silently (safe)

    // 2. Fill common fields (PID, TID, timestamp, comm)
    fill_common(e, ctx);
    e->syscall_id = 1;

    // 3. Extract syscall arguments
    const char *filename = (const char *)ctx->args[1];  // arg 1 is filename

    // 4. Copy from user memory (safe boundary check)
    bpf_probe_read_user_str(&e->path, sizeof(e->path), filename);

    // 5. Atomically submit to ring buffer
    bpf_ringbuf_submit(e, 0);
    return 0;
}

Key Design Decisions

1. Tracepoints vs. Kprobes vs. Uprobes

TypeAttach PointStabilityOverheadUse Case
TracepointSyscall entry/exitStableVery lowFile syscalls
KprobeAny kernel functionUnstableLowInternal kernels
UprobeUserspace functionUnstableMediumApp instrumentation

We use tracepoints for stability across kernel versions.

2. Event Loss Handling

struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;  // Drop on ring buffer full → counter available
  • Ring buffer full? Drop event (no backpressure to kernel)
  • No blocking: kernel continues unaffected
  • Metrics tracked by libbpf (observable in /proc)
  • Trade-off: rare drops vs. guaranteed no kernel impact

3. String Extraction Safety

// BPF verifier ensures this is safe
const char *filename = (const char *)ctx->args[1];
bpf_probe_read_user_str(&e->path, sizeof(e->path), filename);
  • ctx->args[1] is a user-space pointer (user dereferencing is unsafe!)
  • bpf_probe_read_user_str() is a kernel helper that safely reads user memory
  • Null-termination and bounds checking included

4. Timestamp Conversion

e->ts = bpf_ktime_get_ns();  // CLOCK_MONOTONIC in nanoseconds
  • Monotonic: never decreases (important for ordering)
  • Nanosecond precision: sufficient for latency analysis
  • Conversion to wall-clock happens in userspace (via boot time offset)

Userspace Layer (main.go)

Startup & Initialization

func main() {
    // 1. Locate BPF object relative to executable
    execPath, _ := os.Executable()
    objPath := filepath.Join(filepath.Dir(execPath), "../bpf/tracepoints.o")

    // 2. Set rlimit for BPF (required for memory locking)
    unix.Setrlimit(unix.RLIMIT_MEMLOCK, &unix.Rlimit{
        Cur: unix.RLIM_INFINITY,
        Max: unix.RLIM_INFINITY,
    })

    // 3. Load and attach eBPF programs
    ringBufMap, links, _ := bpf.LoadTracepoints(objPath)

    // 4. Create ring buffer reader (mmap'd)
    rd, _ := ringbuf.NewReader(ringBufMap)

    // 5. Record boot time (for timestamp conversion)
    var ts unix.Timespec
    unix.ClockGettime(unix.CLOCK_MONOTONIC, &ts)
    bootTime := time.Now().Add(-time.Duration(ts.Sec*1e9 + ts.Nsec) * time.Nanosecond)

    // 6. Start gRPC server
    s := grpc.NewServer()
    pb.RegisterTrackerServer(s, &server{
        rd: rd,
        bootTime: bootTime,
        clients: make(map[chan *pb.EventBatch]struct{}),
    })

    // 7. Begin event processing
    go serv.broadcastEvents()
    s.Serve(lis)
}

gRPC Server Implementation

type server struct {
    pb.UnimplementedTrackerServer
    rd       *ringbuf.Reader        // Ring buffer reader (kernel ↔ userspace)
    mu       sync.Mutex             // Protects clients map
    clients  map[chan *pb.EventBatch]struct{}  // Per-client channels
    bootTime time.Time              // For timestamp correction
}

// StreamEvents implements TrackerServer
func (s *server) StreamEvents(
    req *pb.Empty,
    stream pb.Tracker_StreamEventsServer,
) error {
    // Create per-client buffered channel
    clientChan := make(chan *pb.EventBatch, 100)

    s.mu.Lock()
    s.clients[clientChan] = struct{}{}
    s.mu.Unlock()

    defer func() {
        s.mu.Lock()
        delete(s.clients, clientChan)
        s.mu.Unlock()
        close(clientChan)
    }()

    // Stream events until client disconnects
    for {
        select {
        case batch := <-clientChan:
            stream.Send(batch)
        case <-stream.Context().Done():
            return stream.Context().Err()
        }
    }
}

Event Broadcasting Loop

func (s *server) broadcastEvents() {
    for {
        // 1. Read raw event from ring buffer (mmap'd, zero-copy)
        record, _ := s.rd.Read()

        // 2. Parse binary data → Go struct
        var e event
        binary.Read(bytes.NewReader(record.RawSample), binary.LittleEndian, &e)

        // 3. Convert timestamp: monotonic → wall-clock
        eventTime := s.bootTime.Add(time.Duration(e.Ts) * time.Nanosecond)

        // 4. Create protobuf event
        pbEvent := &pb.Event{
            Ts:      timestamppb.New(eventTime),
            Pid:     e.Pid,
            Tid:     e.Tid,
            Comm:    sanitizeString(e.Comm[:]),
            Syscall: syscallName(e.SyscallId),
            Path:    sanitizeString(e.Path[:]),
            NewPath: sanitizeString(e.NewPath[:]),
            RetVal:  e.RetVal,
            Bytes:   e.Bytes,
        }

        batch := &pb.EventBatch{Events: []*pb.Event{pbEvent}}

        // 5. Broadcast to all connected clients
        s.mu.Lock()
        for ch := range s.clients {
            select {
            case ch <- batch:
                // Sent successfully
            default:
                // Channel full; drop to prevent blocking
                // Slow clients don't block fast clients
            }
        }
        s.mu.Unlock()
    }
}

Concurrent Client Handling

Problem: If one client is slow (network lag), should it block all others?

Solution: Non-blocking channel send with overflow drop

select {
case ch <- batch:
    // Success
default:
    // Client buffer (100 events) is full
    // Drop the event rather than blocking entire broadcast loop
}

Trade-off:

  • Clients with 100 ms+ latency may miss events
  • Fast clients always get events
  • Tracker loop never blocks (no cascading slowdown)

String Sanitization

func sanitizeString(b []byte) string {
    // Remove null terminators (C strings)
    s := strings.TrimRight(string(b), "\x00")

    // Validate UTF-8 (eBPF may have garbage data)
    if !utf8.ValidString(s) {
        s = strings.ToValidUTF8(s, "?")
    }

    return s
}

Why?

  • eBPF reads user-provided pointers (may be invalid)
  • Protobuf requires valid UTF-8
  • Invalid sequences replaced with "?" for debugging

Timestamp Conversion Math

Kernel monotonic:  T = 12345000000 ns  (12.345 sec since boot)
Now (wall clock):  2025-01-15 10:30:45.123456789 UTC
Boot time (calculated):
  = now - T
  = 2025-01-15 10:30:45.123456789 - 12.345 sec
  = 2025-01-15 10:30:32.778456789 UTC

Event timestamp (wall clock):
  = boot_time + event_monotonic
  = 2025-01-15 10:30:32.778 + 0.001 sec
  = 2025-01-15 10:30:32.779 UTC ✓

Protocol & Serialization

Protobuf Message Layout

message Event {
  google.protobuf.Timestamp ts = 1;      // [0]: varint (wire type 2)
  uint32 pid = 2;                        // [1]: varint
  uint32 tid = 3;                        // [2]: varint
  string comm = 4;                       // [3]: string (wire type 2)
  string syscall = 5;                    // [4]: string
  string path = 6;                       // [5]: string
  string new_path = 7;                   // [6]: string
  int64 ret_val = 8;                     // [7]: sint64 (zigzag)
  uint64 bytes = 9;                      // [8]: varint
  OpenFlags flags = 10;                  // [9]: enum (varint)
}

Wire Format Benefits:

  • Variable-length encoding (small numbers → small bytes)
  • Tag-based backward compatibility (new fields don't break old clients)
  • Language-agnostic (C++, Python, Rust, etc.)

EventBatch Rationale

message EventBatch {
  repeated Event events = 1;  // 1-100 events per batch
}
  • Streaming overhead: gRPC frames per message
  • CPU efficiency: Batch 10-100 events per frame
  • Memory: Protobuf optimizes repeated fields
  • Latency trade-off: ~1-10 ms batching overhead for ~10x throughput

Build & Compile

eBPF Compilation

clang -O2 -target bpf -c tracepoints.c -o tracepoints.o

Flags:

  • -O2: Optimize (unroll loops, inline)
  • -target bpf: Compile for eBPF bytecode (not x86)
  • -c: Compile only (no linking)

Output: tracepoints.o (8-12 KB, relocatable)

Go Build

cd tracker
CGO_ENABLED=1 go build -o bin/tracker ./cmd/tracker

Flags:

  • CGO_ENABLED=1: Allow C/libbpf linking
  • libbpf provides ringbuf and bpf_program__attach_tracepoint()

Full Build Command

.PHONY: tracker
tracker: bpf
 CGO_ENABLED=1 go build -ldflags="-s -w" \
  -o ./bin/tracker ./cmd/tracker/main.go

.PHONY: bpf
bpf:
 mkdir -p ./bpf
 clang -O2 -target bpf -c ./bpf/tracepoints.c \
  -o ./bpf/tracepoints.o

Performance Optimization Techniques

1. Zero-Copy Design

Traditional approach:          eBPF ring buffer approach:
┌─────────────┐                ┌──────────────────┐
│ Kernel mem  │  memcpy()      │ Shared mmap'd    │
├─────────────┤  ──────────→   │ memory region    │
│ User mem    │                ├──────────────────┤
└─────────────┘                │ User reads mmap'd │
                               │ (no copy!)       │
                               └──────────────────┘

Ring buffer uses memory-mapped I/O: kernel and user share the same virtual memory pages. No data copy needed.

2. Atomic Reservation Pattern

// Instead of:
// 1. Lock mutex
// 2. Check buffer space
// 3. Write event
// 4. Unlock mutex
// This is slow!

// eBPF does:
e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
// Atomic operation: allocates space OR returns NULL
// No spinlock, no busy-waiting

3. Batch Processing

// Instead of sending each event immediately:
// 1. Read event
// 2. Create protobuf
// 3. Send to 10 clients
// Repeat: high gRPC frame overhead

// eBPF/userspace do:
// 1. Read 10 events
// 2. Create batch protobuf
// 3. Send ONE frame to all clients
// Result: 10x throughput with same CPU

4. Non-Blocking Slowpath

select {
case ch <- batch:
    // Fast path: client ready
    // ~100 ns
default:
    // Slow path: client buffer full
    // Drop event, continue
    // ~100 ns (no blocking!)
}

vs. synchronous sending (would block until client consumes all buffered events):

ch <- batch  // BLOCKS if buffer is full
// Slow clients now block fast processing loop

Testing & Validation

Unit Tests

make test

Tests cover:

  • Protobuf serialization round-trips
  • Syscall name mapping (1→"openat", etc.)
  • String sanitization (null terminators, UTF-8)
  • Timestamp conversion (boot time calculation)

Integration Tests

make e2e
  1. Start tracker with real eBPF
  2. Generate syscalls in a test pod
  3. Verify gRPC events received
  4. Validate syscall count matches iptrace

Load Testing

# Generate 1k write syscalls/sec
stress-ng --iomix 1 --iomix-bytes 1M --timeout 60s &

# Query tracker
grpcurl -plaintext -d '{}' localhost:50051 nerrf.trace.Tracker/StreamEvents \
  | jq '.events | length'  # Should see batches of 10-100 events

Known Limitations & Future Work

Limitation 1: Write Syscall Path Resolution

SEC("tracepoint/syscalls/sys_enter_write")
int trace_write(struct trace_event_raw_sys_enter *ctx) {
    // args[0] = file descriptor (integer)
    // args[1] = buffer pointer (data, not path!)
    // args[2] = count (bytes)

    // Problem: We have FD, but need to resolve FD → inode → path
    // Solution: Use kprobe(vfs_write) + BPF helper bpf_get_file_path()
}

Fix in M2:

  • Use kprobes on vfs_write() instead of tracepoint
  • Kprobes have access to kernel structures (file descriptor table)
  • Resolve fd → inode → path safely

Limitation 2: Context Aggregation

Problem: Related syscalls are scattered

T=1ms   openat("/app/file.dat", O_WRONLY)     → fd=5
T=2ms   write(fd=5, buffer, 1024)              → ???
T=3ms   close(fd=5)                            → ???

Solution in M2:

  • Thread-local storage (BPF_MAP_TYPE_HASH_MAP)
  • Track fd → path mapping per-thread
  • Annotate write events with file paths

Limitation 3: Performance at 10k+ evt/sec

Constraint: CPU scaling hits 100% at ~8k evt/sec (4-core VM)

Solutions for scale-out:

  • Multiple tracker instances per node
  • Hash-based client shard (consistent hashing)
  • Event filtering in BPF (drop known-safe patterns)

Debugging Tips

View Ring Buffer Maps

# As root:
bpftool map list
bpftool map dump name events

Check Attached eBPF Programs

bpftool prog list
# Shows all attached BPF programs, IDs, and hooks

Monitor Ring Buffer Backpressure

# Events dropped due to buffer full:
bpftool map dump name events | grep lost

gRPC Reflection Query

# List all services
grpcurl -plaintext localhost:50051 list

# Show TrackerServer methods
grpcurl -plaintext localhost:50051 nerrf.trace.Tracker.Tracker

# Get descriptor for Event message
grpcurl -plaintext localhost:50051 describe nerrf.trace.Event

Enable Verbose Logging

RUST_LOG=debug TRACKER_LISTEN_ADDR=0.0.0.0:50051 ./bin/tracker

See Also