Tracker Node — Person Tracking with DeepSORT + ReID

Source file: vision/packages/vision_general/scripts/tracker_node.py Base algorithm: Deep SORT — Simple Online and Realtime Tracking with a Deep Association Metric

Overview

tracker_node.py is a ROS2 node that tracks a single target person across frames using a combination of:

YOLOv8n — person detection (class 0)
DeepSORT — multi-person tracking with persistent IDs via Kalman filtering + cosine-distance association
ReID model (Swin Transformer) — appearance-based re-identification when the tracked person's ID is lost

The node publishes the tracked person's 3D world coordinates and a normalized 2D centroid for downstream use (e.g., robot navigation/following).

Architecture

Camera (RGB + Depth)
        │
        ▼
   YOLOv8n (TRT)          ← detects all people (class 0)
        │
        ▼
   DeepSORT Tracker        ← assigns consistent track IDs across frames
   (Kalman + cosine dist)
        │
        ▼
   ReID Feature Extractor  ← Swin Transformer, FP16 on Orin AGX
   (per-person embeddings)
        │
        ├─ [Track ID matches] → publish 3D point + centroid
        │
        └─ [Track ID lost]   → Re-ID: compare new detections against
                                stored embeddings (angle-aware)

How It Works

1. Target Selection (`set_target`)

When a tracking service is called, the node runs YOLO + DeepSORT for N_INIT + 1 frames to let tracks confirm, then selects a target by one of four modes:

Mode	Service	Description
`largest_person`	`set_tracking_target` (SetBool)	Picks the person with the largest bounding box area
`gestures`	`set_tracking_target_by` (TrackBy)	Uses `PoseDetection.detectGesture()` (e.g., `waving`)
`poses`	`set_tracking_target_by` (TrackBy)	Queries Moondream VLM: standing / sitting / lying down
`color`	`set_tracking_target_by` (TrackBy)	Queries Moondream VLM: clothing color match

Moondream queries are sent as crop-bounded prompts via the CropQuery service.

2. Tracking Loop (`run`, 10 Hz)

Each tick: - Runs YOLO + DeepSORT on the latest frame - Looks for the target's track_id among confirmed tracks - If found: updates coordinates, extracts a new embedding at REID_EXTRACT_FREQ = 0.3 Hz, stores angle-specific embeddings (forward/backward/left/right) - Publishes 3D position and normalized 2D centroid

3. Re-Identification

When the target's track_id is no longer in any confirmed track: - Extracts ReID embeddings for all currently visible people (batch) - Determines each person's viewing angle via PoseDetection.personAngle() - If angle is known: compares against the angle-specific stored embedding (cosine threshold = 0.7) - If angle is unknown: compares against the full embedding bank (up to MAX_EMBEDDINGS = 128) - On match: resets person_data and assigns the new track_id as the target

Utility Modules Used

`vision_general.utils.deep_sort` (from nwojke/deep_sort)

Module	Used for
`deep_sort.tracker.Tracker`	Multi-target tracker (Kalman filter + Hungarian assignment)
`deep_sort.detection.Detection`	Wraps a bounding box + confidence + appearance feature
`deep_sort.nn_matching.NearestNeighborDistanceMetric`	Cosine-distance metric for track-to-detection association
`deep_sort.kalman_filter`	(internal) State estimation for track positions
`deep_sort.linear_assignment`	(internal) Hungarian algorithm for assignment
`deep_sort.iou_matching`	(internal) IoU-based gating
`deep_sort.track`	(internal) Track state machine (Tentative → Confirmed → Deleted)

`vision_general.utils.reid_model`

Function	Used for
`get_structure()`	Returns the ReID model architecture (Swin Transformer by default)
`load_network(structure)`	Loads pretrained weights from `models/swin/`
`extract_feature_from_img(pil_img, model)`	Extracts a single appearance embedding
`extract_feature_from_img_batch(img_list, model)`	Batch embedding extraction (used during re-ID)
`compare_images(emb1, emb2, threshold)`	Cosine similarity check between two embeddings
`compare_images_batch(emb, bank, threshold)`	Checks if an embedding exists in the stored bank

The ReID model is a Swin Transformer (ft_net_swin) trained on Market-1501 (751 person IDs). It runs in FP16 on NVIDIA Orin AGX (capability ≥ 7.0) and falls back to FP32 elsewhere. The classifier head is removed (nn.Sequential()) so the output is the raw 512-dim feature vector.

`vision_general.utils.calculations`

Function	Used for
`get2DCentroid(bbox, depth_img)`	Gets the 2D centroid pixel of the tracked person
`get_depth(depth_img, point2D)`	Reads the depth value at a pixel
`deproject_pixel_to_point(camera_info, point2D, depth)`	Back-projects pixel + depth → 3D camera-frame point

`vision_general.utils.trt_utils`

Function	Used for
`load_yolo_trt(model_name)`	Loads YOLOv8n with optional TensorRT acceleration (Orin AGX)

`pose_detection.PoseDetection`

Method	Used for
`detectGesture(img)`	Returns gesture enum (e.g., waving)
`personAngle(img)`	Returns viewing angle: `"forward"`, `"backward"`, `"left"`, `"right"`, or `None`
`_get_keypoints(img)`	Raw keypoint extraction
`is_waving_from_keypoints(pts, kpc)`	Special-case waving detection for `wavingCustomer`

ROS2 Interface

Subscriptions

Topic (constant)	Type	Purpose
`CAMERA_TOPIC`	`sensor_msgs/Image`	RGB camera feed
`DEPTH_IMAGE_TOPIC`	`sensor_msgs/Image`	Depth image (32FC1)
`CAMERA_INFO_TOPIC`	`sensor_msgs/CameraInfo`	Camera intrinsics for deprojection

Publishers

Topic (constant)	Type	Purpose
`RESULTS_TOPIC`	`geometry_msgs/PointStamped`	3D world coordinates of tracked person
`TRACKER_IMAGE_TOPIC`	`sensor_msgs/Image`	Annotated debug image
`CENTROID_TOIC`	`geometry_msgs/Point`	Normalized x-centroid (range −1 to +1)

Services (server)

Topic (constant)	Type	Purpose
`SET_TARGET_TOPIC`	`std_srvs/SetBool`	Enable/disable tracking (largest person default)
`SET_TARGET_BY_TOPIC`	`frida_interfaces/TrackBy`	Enable tracking by gesture, pose, or color
`IS_TRACKING_TOPIC`	`std_srvs/Trigger`	Query current tracking status

Service Clients

Topic (constant)	Type	Purpose
`CROP_QUERY_TOPIC`	`frida_interfaces/CropQuery`	VLM crop query via Moondream node

Key Configuration Parameters

Parameter	Value	Description
`CONF_THRESHOLD`	`0.6`	Minimum YOLO confidence to create a detection
`DEEPSORT_MAX_COSINE_DISTANCE`	`0.3`	Max cosine distance for DeepSORT association
`DEEPSORT_NN_BUDGET`	`100`	Max stored appearance samples per track
`DEEPSORT_MAX_AGE`	`100`	Frames before an unmatched track is deleted
`DEEPSORT_N_INIT`	`3`	Frames of consecutive detections to confirm a track
`REID_EXTRACT_FREQ`	`0.3` Hz	How often to extract and store new embeddings
`MAX_EMBEDDINGS`	`128`	Max unique embeddings stored per target
`DEPTH_THRESHOLD`	`100` ns	Timestamp tolerance to sync depth + RGB frames

Running the Node

# Terminal 1 — tracker
ros2 run vision_general tracker_node

# Terminal 2 — set target (largest person)
ros2 service call /vision/set_tracking_target std_srvs/srv/SetBool "{data: true}"

# Terminal 2 — set target by gesture
ros2 service call /vision/set_tracking_target_by frida_interfaces/srv/TrackBy \
    "{track_enabled: true, track_by: 'gestures', value: 'waving'}"

# Terminal 2 — set target by clothing color (requires Moondream)
ros2 service call /vision/set_tracking_target_by frida_interfaces/srv/TrackBy \
    "{track_enabled: true, track_by: 'color', value: 'red shirt'}"

# Terminal 3 (optional, required for poses/color) — Moondream VLM
ros2 run vision_general moondream_node

# Terminal 3 (Orin only) — ZED camera
./run.sh zed

Notes

The node uses a MultiThreadedExecutor with 4 threads; image subscription and moondream calls run in separate ReentrantCallbackGroups to avoid blocking.
Depth synchronization uses nanosecond timestamps; if depth and RGB are more than DEPTH_THRESHOLD ns apart, the 3D publish is skipped with a warning.
The frame reference is hardcoded to "zed_left_camera_optical_frame".
During re-ID, all visible people are processed in a single batch inference pass for efficiency.