Moondream
Moondream is an open-source visual language model that understands images using simple text prompts.
We integrated Moondream along with YOLOv8 and gRPC to power robot vision capabilities like posture analysis, object detection, and natural language querying.
Contents
Architecture
Features
- Detect person and crop the region with largest area (via YOLOv8)
- Classify person posture (standing, sitting, lying)
- Locate beverages (left, center, right)
- Natural language query over full image or crop
- Serve MoonDream model via gRPC for scalable vision requests
ROS Node
Node Name: MoondreamNode
Topic Subscribed:
- CAMERA_TOPIC
(sensor_msgs/msg/Image
)
Services Provided:
Service Name | Description |
---|---|
/beverage_location |
Locate a beverage (returns LEFT, CENTER, RIGHT, or NOT_FOUND) |
/person_posture |
Detects posture of a person in the frame |
/query |
Runs a natural language query on the image |
/crop_query |
Describes a cropped region based on bounding box |
Core Methods
image_callback
: Updates internal image on new framequery_callback
: Runs general-purpose querycrop_query_callback
: Describes cropped regionsbeverage_location_callback
: Finds beverage positionperson_posture_callback
: Uses MoonDream to describe posturedetect_and_crop_person
: Extracts largest detected person via YOLOv8
gRPC Server
Entry Point: serve()
Starts a gRPC server on port 50052, providing:
- EncodeImage
- FindBeverage
- Query
def serve(**kwargs):
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
md_model = MoonDreamModel(**kwargs)
moondream_proto_pb2_grpc.add_MoonDreamServiceServicer_to_server(
MoonDreamServicer(md_model), server
)
server.add_insecure_port("[::]:50052")
server.start()
Command-line Arguments:
- --model_name
: Model name on Hugging Face
- --revision
: Revision tag for model version
- --device_map
: e.g., "jetson"
, "cpu"
, "cuda"
MoonDreamModel
This is the interface class to the underlying multimodal model (vikhyatk/moondream2
).
Methods
Compresses and encodes image to be sent via gRPC. Returns beverage position:"left"
, "center"
, "right"
or "not found"
.
Executes natural language query on the visual input.
Protobuf API
Defined in moondream_proto.proto
syntax = "proto3";
service MoonDreamService {
rpc EncodeImage (ImageRequest) returns (EncodedImageResponse);
rpc FindBeverage (FindBeverageRequest) returns (BeveragePositionResponse);
rpc Query (QueryRequest) returns (QueryResponse);
}
message ImageRequest {
bytes image_data = 1;
}
message EncodedImageResponse {
bytes encoded_image = 1;
}
message FindBeverageRequest {
bytes encoded_image = 1;
string subject = 2;
}
message BeveragePositionResponse {
string position = 1;
}
message QueryRequest {
bytes encoded_image = 1;
string query = 2;
}
message QueryResponse {
string answer = 1;
}
Testing Example
Run the gRPC test client:
Expected output: