Implementation Guide

Deep dive into Tracker's architecture, design patterns, and internal implementation

Code Organization

tracker/
├── bpf/
│   └── tracepoints.c          # 81 lines of kernel code
├── cmd/tracker/
│   └── main.go                # 331 lines of Go userspace
├── pkg/
│   ├── bpf/
│   │   └── loader.go          # libbpf Go wrapper
│   └── pb/
│       ├── trace.pb.go        # Generated protobuf code
│       └── trace_grpc.pb.go   # Generated gRPC code
├── proto/
│   └── trace.proto            # 50 lines of service definition
├── scripts/
│   └── install-deps.sh        # One-liner dependency setup
├── Dockerfile.minimal         # 50 MB image
├── Makefile                   # Build orchestration
└── README.md                  # Quick-start

eBPF Layer (`tracepoints.c`)

Event Struct Definition

struct event {
    __u64 ts;              // Kernel monotonic timestamp (nanoseconds)
    __u32 pid;             // Process ID (from kernel context)
    __u32 tid;             // Thread ID (from kernel context)
    char comm[16];         // Executable name (kernel-provided)
    __u32 syscall_id;      // 1=openat, 2=write, 3=rename
    __s64 ret_val;         // System call return value
    __u64 bytes;           // Bytes written (write syscall only)
    char path[256];        // File path (user-provided arg)
    char new_path[256];    // Destination path (rename only)
};

Memory Layout: 600 bytes (fits in ring buffer slot)

Ring Buffer Map

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 256 * 1024);  // 256 KB
} events SEC(".maps");

Why BPF_MAP_TYPE_RINGBUF?

Single reader model (perfect for streaming to one userspace daemon)
Atomic reservation without spinlocks
Automatic wraparound (no manual buffer management)
Memory-efficient: each writer claims a slot atomically

Tracepoint Attachment Pattern

SEC("tracepoint/syscalls/sys_enter_openat")
int trace_openat(struct trace_event_raw_sys_enter *ctx) {
    // 1. Reserve event slot atomically
    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;  // Buffer full, drop silently (safe)

    // 2. Fill common fields (PID, TID, timestamp, comm)
    fill_common(e, ctx);
    e->syscall_id = 1;

    // 3. Extract syscall arguments
    const char *filename = (const char *)ctx->args[1];  // arg 1 is filename

    // 4. Copy from user memory (safe boundary check)
    bpf_probe_read_user_str(&e->path, sizeof(e->path), filename);

    // 5. Atomically submit to ring buffer
    bpf_ringbuf_submit(e, 0);
    return 0;
}

Key Design Decisions

1. Tracepoints vs. Kprobes vs. Uprobes

Type	Attach Point	Stability	Overhead	Use Case
Tracepoint	Syscall entry/exit	Stable	Very low	File syscalls
Kprobe	Any kernel function	Unstable	Low	Internal kernels
Uprobe	Userspace function	Unstable	Medium	App instrumentation

We use tracepoints for stability across kernel versions.

2. Event Loss Handling

struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
if (!e) return 0;  // Drop on ring buffer full → counter available

Ring buffer full? Drop event (no backpressure to kernel)
No blocking: kernel continues unaffected
Metrics tracked by libbpf (observable in /proc)
Trade-off: rare drops vs. guaranteed no kernel impact

3. String Extraction Safety

// BPF verifier ensures this is safe
const char *filename = (const char *)ctx->args[1];
bpf_probe_read_user_str(&e->path, sizeof(e->path), filename);

ctx->args[1] is a user-space pointer (user dereferencing is unsafe!)
bpf_probe_read_user_str() is a kernel helper that safely reads user memory
Null-termination and bounds checking included

4. Timestamp Conversion

e->ts = bpf_ktime_get_ns();  // CLOCK_MONOTONIC in nanoseconds

Monotonic: never decreases (important for ordering)
Nanosecond precision: sufficient for latency analysis
Conversion to wall-clock happens in userspace (via boot time offset)

Userspace Layer (`main.go`)

Startup & Initialization

func main() {
    // 1. Locate BPF object relative to executable
    execPath, _ := os.Executable()
    objPath := filepath.Join(filepath.Dir(execPath), "../bpf/tracepoints.o")

    // 2. Set rlimit for BPF (required for memory locking)
    unix.Setrlimit(unix.RLIMIT_MEMLOCK, &unix.Rlimit{
        Cur: unix.RLIM_INFINITY,
        Max: unix.RLIM_INFINITY,
    })

    // 3. Load and attach eBPF programs
    ringBufMap, links, _ := bpf.LoadTracepoints(objPath)

    // 4. Create ring buffer reader (mmap'd)
    rd, _ := ringbuf.NewReader(ringBufMap)

    // 5. Record boot time (for timestamp conversion)
    var ts unix.Timespec
    unix.ClockGettime(unix.CLOCK_MONOTONIC, &ts)
    bootTime := time.Now().Add(-time.Duration(ts.Sec*1e9 + ts.Nsec) * time.Nanosecond)

    // 6. Start gRPC server
    s := grpc.NewServer()
    pb.RegisterTrackerServer(s, &server{
        rd: rd,
        bootTime: bootTime,
        clients: make(map[chan *pb.EventBatch]struct{}),
    })

    // 7. Begin event processing
    go serv.broadcastEvents()
    s.Serve(lis)
}

gRPC Server Implementation

type server struct {
    pb.UnimplementedTrackerServer
    rd       *ringbuf.Reader        // Ring buffer reader (kernel ↔ userspace)
    mu       sync.Mutex             // Protects clients map
    clients  map[chan *pb.EventBatch]struct{}  // Per-client channels
    bootTime time.Time              // For timestamp correction
}

// StreamEvents implements TrackerServer
func (s *server) StreamEvents(
    req *pb.Empty,
    stream pb.Tracker_StreamEventsServer,
) error {
    // Create per-client buffered channel
    clientChan := make(chan *pb.EventBatch, 100)

    s.mu.Lock()
    s.clients[clientChan] = struct{}{}
    s.mu.Unlock()

    defer func() {
        s.mu.Lock()
        delete(s.clients, clientChan)
        s.mu.Unlock()
        close(clientChan)
    }()

    // Stream events until client disconnects
    for {
        select {
        case batch := <-clientChan:
            stream.Send(batch)
        case <-stream.Context().Done():
            return stream.Context().Err()
        }
    }
}

Event Broadcasting Loop

func (s *server) broadcastEvents() {
    for {
        // 1. Read raw event from ring buffer (mmap'd, zero-copy)
        record, _ := s.rd.Read()

        // 2. Parse binary data → Go struct
        var e event
        binary.Read(bytes.NewReader(record.RawSample), binary.LittleEndian, &e)

        // 3. Convert timestamp: monotonic → wall-clock
        eventTime := s.bootTime.Add(time.Duration(e.Ts) * time.Nanosecond)

        // 4. Create protobuf event
        pbEvent := &pb.Event{
            Ts:      timestamppb.New(eventTime),
            Pid:     e.Pid,
            Tid:     e.Tid,
            Comm:    sanitizeString(e.Comm[:]),
            Syscall: syscallName(e.SyscallId),
            Path:    sanitizeString(e.Path[:]),
            NewPath: sanitizeString(e.NewPath[:]),
            RetVal:  e.RetVal,
            Bytes:   e.Bytes,
        }

        batch := &pb.EventBatch{Events: []*pb.Event{pbEvent}}

        // 5. Broadcast to all connected clients
        s.mu.Lock()
        for ch := range s.clients {
            select {
            case ch <- batch:
                // Sent successfully
            default:
                // Channel full; drop to prevent blocking
                // Slow clients don't block fast clients
            }
        }
        s.mu.Unlock()
    }
}

Concurrent Client Handling

Problem: If one client is slow (network lag), should it block all others?

Solution: Non-blocking channel send with overflow drop

select {
case ch <- batch:
    // Success
default:
    // Client buffer (100 events) is full
    // Drop the event rather than blocking entire broadcast loop
}

Trade-off:

Clients with 100 ms+ latency may miss events
Fast clients always get events
Tracker loop never blocks (no cascading slowdown)

String Sanitization

func sanitizeString(b []byte) string {
    // Remove null terminators (C strings)
    s := strings.TrimRight(string(b), "\x00")

    // Validate UTF-8 (eBPF may have garbage data)
    if !utf8.ValidString(s) {
        s = strings.ToValidUTF8(s, "?")
    }

    return s
}

Why?

eBPF reads user-provided pointers (may be invalid)
Protobuf requires valid UTF-8
Invalid sequences replaced with "?" for debugging

Timestamp Conversion Math

Kernel monotonic:  T = 12345000000 ns  (12.345 sec since boot)
Now (wall clock):  2025-01-15 10:30:45.123456789 UTC
Boot time (calculated):
  = now - T
  = 2025-01-15 10:30:45.123456789 - 12.345 sec
  = 2025-01-15 10:30:32.778456789 UTC

Event timestamp (wall clock):
  = boot_time + event_monotonic
  = 2025-01-15 10:30:32.778 + 0.001 sec
  = 2025-01-15 10:30:32.779 UTC ✓

Protocol & Serialization

Protobuf Message Layout

message Event {
  google.protobuf.Timestamp ts = 1;      // [0]: varint (wire type 2)
  uint32 pid = 2;                        // [1]: varint
  uint32 tid = 3;                        // [2]: varint
  string comm = 4;                       // [3]: string (wire type 2)
  string syscall = 5;                    // [4]: string
  string path = 6;                       // [5]: string
  string new_path = 7;                   // [6]: string
  int64 ret_val = 8;                     // [7]: sint64 (zigzag)
  uint64 bytes = 9;                      // [8]: varint
  OpenFlags flags = 10;                  // [9]: enum (varint)
}

Wire Format Benefits:

Variable-length encoding (small numbers → small bytes)
Tag-based backward compatibility (new fields don't break old clients)
Language-agnostic (C++, Python, Rust, etc.)

EventBatch Rationale

message EventBatch {
  repeated Event events = 1;  // 1-100 events per batch
}

Streaming overhead: gRPC frames per message
CPU efficiency: Batch 10-100 events per frame
Memory: Protobuf optimizes repeated fields
Latency trade-off: ~1-10 ms batching overhead for ~10x throughput

Build & Compile

eBPF Compilation

clang -O2 -target bpf -c tracepoints.c -o tracepoints.o

Flags:

-O2: Optimize (unroll loops, inline)
-target bpf: Compile for eBPF bytecode (not x86)
-c: Compile only (no linking)

Output: tracepoints.o (8-12 KB, relocatable)

Go Build

cd tracker
CGO_ENABLED=1 go build -o bin/tracker ./cmd/tracker

Flags:

CGO_ENABLED=1: Allow C/libbpf linking
libbpf provides ringbuf and bpf_program__attach_tracepoint()

Full Build Command

.PHONY: tracker
tracker: bpf
 CGO_ENABLED=1 go build -ldflags="-s -w" \
  -o ./bin/tracker ./cmd/tracker/main.go

.PHONY: bpf
bpf:
 mkdir -p ./bpf
 clang -O2 -target bpf -c ./bpf/tracepoints.c \
  -o ./bpf/tracepoints.o

Performance Optimization Techniques

1. Zero-Copy Design

Traditional approach:          eBPF ring buffer approach:
┌─────────────┐                ┌──────────────────┐
│ Kernel mem  │  memcpy()      │ Shared mmap'd    │
├─────────────┤  ──────────→   │ memory region    │
│ User mem    │                ├──────────────────┤
└─────────────┘                │ User reads mmap'd │
                               │ (no copy!)       │
                               └──────────────────┘

Ring buffer uses memory-mapped I/O: kernel and user share the same virtual memory pages. No data copy needed.

2. Atomic Reservation Pattern

// Instead of:
// 1. Lock mutex
// 2. Check buffer space
// 3. Write event
// 4. Unlock mutex
// This is slow!

// eBPF does:
e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
// Atomic operation: allocates space OR returns NULL
// No spinlock, no busy-waiting

3. Batch Processing

// Instead of sending each event immediately:
// 1. Read event
// 2. Create protobuf
// 3. Send to 10 clients
// Repeat: high gRPC frame overhead

// eBPF/userspace do:
// 1. Read 10 events
// 2. Create batch protobuf
// 3. Send ONE frame to all clients
// Result: 10x throughput with same CPU

4. Non-Blocking Slowpath

select {
case ch <- batch:
    // Fast path: client ready
    // ~100 ns
default:
    // Slow path: client buffer full
    // Drop event, continue
    // ~100 ns (no blocking!)
}

vs. synchronous sending (would block until client consumes all buffered events):

ch <- batch  // BLOCKS if buffer is full
// Slow clients now block fast processing loop

Testing & Validation

Unit Tests

make test

Tests cover:

Protobuf serialization round-trips
Syscall name mapping (1→"openat", etc.)
String sanitization (null terminators, UTF-8)
Timestamp conversion (boot time calculation)

Integration Tests

make e2e

Start tracker with real eBPF
Generate syscalls in a test pod
Verify gRPC events received
Validate syscall count matches iptrace

Load Testing

# Generate 1k write syscalls/sec
stress-ng --iomix 1 --iomix-bytes 1M --timeout 60s &

# Query tracker
grpcurl -plaintext -d '{}' localhost:50051 nerrf.trace.Tracker/StreamEvents \
  | jq '.events | length'  # Should see batches of 10-100 events

Known Limitations & Future Work

Limitation 1: Write Syscall Path Resolution

SEC("tracepoint/syscalls/sys_enter_write")
int trace_write(struct trace_event_raw_sys_enter *ctx) {
    // args[0] = file descriptor (integer)
    // args[1] = buffer pointer (data, not path!)
    // args[2] = count (bytes)

    // Problem: We have FD, but need to resolve FD → inode → path
    // Solution: Use kprobe(vfs_write) + BPF helper bpf_get_file_path()
}

Fix in M2:

Use kprobes on vfs_write() instead of tracepoint
Kprobes have access to kernel structures (file descriptor table)
Resolve fd → inode → path safely

Limitation 2: Context Aggregation

Problem: Related syscalls are scattered

T=1ms   openat("/app/file.dat", O_WRONLY)     → fd=5
T=2ms   write(fd=5, buffer, 1024)              → ???
T=3ms   close(fd=5)                            → ???

Solution in M2:

Thread-local storage (BPF_MAP_TYPE_HASH_MAP)
Track fd → path mapping per-thread
Annotate write events with file paths

Limitation 3: Performance at 10k+ evt/sec

Constraint: CPU scaling hits 100% at ~8k evt/sec (4-core VM)

Solutions for scale-out:

Multiple tracker instances per node
Hash-based client shard (consistent hashing)
Event filtering in BPF (drop known-safe patterns)

Debugging Tips

View Ring Buffer Maps

# As root:
bpftool map list
bpftool map dump name events

Check Attached eBPF Programs

bpftool prog list
# Shows all attached BPF programs, IDs, and hooks

Monitor Ring Buffer Backpressure

# Events dropped due to buffer full:
bpftool map dump name events | grep lost

gRPC Reflection Query

# List all services
grpcurl -plaintext localhost:50051 list

# Show TrackerServer methods
grpcurl -plaintext localhost:50051 nerrf.trace.Tracker.Tracker

# Get descriptor for Event message
grpcurl -plaintext localhost:50051 describe nerrf.trace.Event

Enable Verbose Logging

RUST_LOG=debug TRACKER_LISTEN_ADDR=0.0.0.0:50051 ./bin/tracker

Implementation Guide

On this page