An End-to-End Video-Based System for Street Food Safety Inspection Using Vision-Language Models and Multi-Source Truth Discovery

SSA ACT branch talk

Patrick Li

RSFAS, ANU

About me

Hi! I am Patrick Li.

I completed my PhD in Statistics at the Department of Econometrics and Business Statistics, Monash University.
I am a postdoctoral researcher at ANU contributing to the Analytics for the Australian Grains Industry (AAGI) project, where my work centres on machine learning, image analytics, and plant phenotyping.

Acknowledgement

Co-authors

A/Prof Klaus Ackermann

Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia

A/Prof Denni Tommasi

Department of Economics, Bologna, University of Bologna, Italy

Prior study contributors

Dr Alexey Chernikov
A/Prof Caitlin Brown

Prior study publications

Chernikov, A., Ackermann, K., Brown, C., & Tommasi, D. (2025, April). Leveraging computer vision and visual LLMs for cost-effective and consistent street food safety assessment in Kolkata India. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 39, No. 27, pp. 27914-27921).

Brown, C., & Tommasi, D. (2025). Quality Upgrading in the Street Food Market: Is Better Equipment and Training Sufficient?.

Outline of the talk

Problem setup: street food safety inspection
Representative frame selection
Visual large language model (VLLM) assessment
Multi-source annotation + truth discovery

Street food safety inspection

There are over 600M cases of foodborne illnesses and 420K deaths annually from contaminated food (World Health Organization 2020).

Street food is a major source of employment and an important food supply in many developing countries, but it is also often associated with significant food safety concerns.

Street food safety inspection

Street food safety inspections are primarily conducted through human surveys, which are often:

Resource-intensive

requiring substantial time, travel effort, and personnel training
involving extensive field visits across large and dispersed vendor populations

Limited in coverage

remote or informal vendors may be excluded from routine monitoring
repeated inspections are often necessary to achieve adequate reliability and spatial coverage

Prone to bias

influenced by prior familiarity with vendors
shaped by previous purchasing or inspection experiences
subject to subjective interpretation of hygiene and safety conditions

Field study in Kolkata, India

In 2024–2025, an FAO-funded field study was conducted in Kolkata, India, following the survey protocol of Brown and Tommasi (2025).

244 street food vendors surveyed (out of 328 approached)
- Each vendor was visited 1–7 times on different dates
~900 videos collected
~6,000 images extracted for different facilities classes (e.g. hand washing stations)
Data gathered by 3 highly trained local surveyors/officers using standardized food safety inspection procedures
Surveyors assessed hygiene and infrastructure conditions using structured yes / no / unknown questionnaires

Sample video

A video collected at 22 July 2024

Sample images extracted from videos (6 classes)

Questionnaires (25 questions)

Category	Description
HW	Is the hand washing facility at least 1 meter above ground level?
HW	Does the hand washing facility have a lid?
HW	Is there soap available and maximum an arm’s length away from the tap?
HW	Is water after washing hands collected in some container? That is, it does NOT go on the ground and leaves a puddle.
WP	Is the water storage tank covered by a lid?
WP	Is the water storage tank cracked or does the tank have holes?
DW	Is the dish-washing station at least 1 meter above ground level? This is a station for washing plates, pots and cutlery.
DW	Are the dirty plates, pots, cutlery waiting to be washed on the ground or floor?
DW	Are the dirty plates, pots, cutlery waiting to be washed in a container protected from the ground?
DW	Do the buckets containing water have smooth surfaces?
DW	Is the ground around the dish-washing station free of debris (rest of food, other waste)?
DW	Is there more than one water bucket around the dish-washing area (e.g. one for cleaning and one for rinsing)?
DW	Is there soap or detergent available and maximum an arm’s length away from the station?
GB	Are the garbage bins made of hard material? E.g. metal, hard plastic
GB	Do the garbage bins have a smooth top area?
GB	Are there animals/insects in or around the garbage bins?
GB	Does the area around the garbage bins have standing water?
GB	Are birds, insects, rodents or other animals present at the stall?
CT	Is there a food preparation area? That is, a dedicated space for preparing food?
CT	Is the top area (e.g. counter top) for food preparation waterproof?
CT	Is the top area for food preparation (e.g. a counter top) cracked or does it have holes?
CT	Can the whole top area for food preparation (e.g. a counter top) be easily accessed for cleaning and drying?
DS	Is the prepared food displayed protected from direct exposure to sun/rain?
DS	Is the stall where food is prepared and sold positioned under a roof?
DS	Is the cooked or raw food out of the sun?

Representative frame selection

A video can be represented as a sequence of frames (images) F_t, for t = 1, \dots, T.

While humans can answer questions by watching the full video, computer vision models often benefit from selecting a smaller set of representative frames for efficient question answering.

The pipeline consists of six stages:

Candidate frame selection
Sharpness filtering
De-duplication
Object detection
Feature extraction
K-means clustering

Step 1: Candidate frame selection

For each video frame F_t, compute the average optical flow magnitude

M(F_t)=\frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H} \sqrt{u(x,y)^2+v(x,y)^2},

where W and H are the frame width and height, and (u,v) is the optical flow estimated between consecutive frames F_{t-1} and F_t using the Gunnar–Farneback algorithm.

A frame is selected if M(F_t) < \tau_{\text{movement}}.

Interpretation

Optical flow is a 2D vector field that describes the apparent motion of pixels between consecutive frames.
M(F_t) summarizes the overall amount of motion in the frame.
Frames with low motion (more stable scenes and less blur) are retained as candidate frames.

Step 2: Sharpness filtering

We compute sharpness using the variance of the Laplacian response:

S(F_t) = \text{Var}(\nabla^2 F_t) = \frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H}(L(x,y)-\bar{L})^2,

where L(x,y) = \frac{\partial^2 F_t}{\partial x^2} + \frac{\partial^2 F_t}{\partial y^2} is the Laplacian response at pixel (x,y), and \bar{L} is its mean over the frame.

A frame is selected if S(F_t) \ge \tau_{\text{sharp}}.

Interpretation

Sharpness is measured via the variance of the Laplacian, which captures how strongly intensity changes across the image.
Higher variance indicates stronger edges and more texture.
Low variance corresponds to blurred frames with weak edge structure.

Step 3: Deduplication

Each frame F_t is encoded using perceptual hashing:

h_t = \text{pHash}(F_t) \in \{0,1\}^{64}

The similarity between frame i and j is measured via Hamming distance:

d_{ij} = \sum_{k=1}^{64} \mathbf{1}(h_{ik} \ne h_{jk})

A frame F_j is removed if there exists a previous frame F_{i} such that d_{ij} \le \delta_{\text{dup}}.

Interpretation

pHash maps each frame into a compact binary fingerprint that preserves perceptual content.
Hamming distance counts the number of differing bits between two hashes.
Small distance means two frames are visually very similar and near-duplicates.

Example frames after step 1-3

Step 4: Object detection

For each selected frame F_t, a pre-trained YOLOv10 model is applied to detect and localize objects in each frame:

\{(c_k, p_k, \mathbf{b}_k)\}_{k=1}^{K},

where c_k is the class label, p_k \in [0,1] is the confidence score, and \mathbf{b}_k = [x, y, w, h] denotes the bounding box (center coordinates and size). Here, K is the total number of detected objects in the frame.

An object is selected if p_k \ge \tau_{\text{od}}.

For each retained detection, an object crop C_k is extracted from the frame for further processing.

YOLOv10 training

After Steps 1–3, we extracted frames from videos
Surveyors manually annotated bounding boxes
- 6 object classes: CT, DS, GB, DW, WP, HW
Dataset split:
- ~5,700 training images
- ~570 test images

Step 5: Feature extraction

For each object crop C_k, a pre-trained MobileNetV3 is used to extract a compact feature representation:

\mathbf{f}_k = \phi(C_k) \in \mathbb{R}^{64},

where \phi(\cdot) denotes the feature extractor and the output is taken from the global pooling layer.

MobileNetV3 training

The same dataset used for YOLOv10 training is used.
A pre-trained MobileNetV3 encodes each object crop into a low-dimensional feature vector.
The resulting feature \mathbf{f}_k captures the visual semantics of the object (e.g., shape, texture, appearance).

Step 6: K-means clustering (per class)

For each class c, we take all extracted feature vectors

\{\mathbf{f}_k\}_{k=1}^{N_c}.

We apply K-means clustering using Euclidean distance to group similar object instances. The number of clusters is set adaptively as:

K_c = \min(5, N_c)

Each cluster is represented by a centroid (the mean feature vector of that group).

For each cluster, we select the single feature vector that is closest to its centroid, and keep the corresponding crop as the representative sample.

Interpretation

Clustering is done within each class separately, so we capture diversity inside the same category.
Each class is summarized by up to 5 representative visual patterns.
This produces a compact but diverse set of representative object crops per class.

Example crops after step 1-6

Counter top (CT)

Food displayed area (DS)

Dish-washing area (DW)

Garbage bin (GB)

Hand washing facility (HW)

Water storage tank (WP)

Visual large language model (VLLM) assessment

Class-specific questionnaires + corresponding images → ChatGPT-4o

Text prompt:

(1) Are the garbage bins made of hard material? E.g. metal, hard plastic? Answer type- Yes/No/Unknown
(2) Do the garbage bins have a smooth top area? Answer type- Yes/No/Unknown
(3) Are there animals/insects in or around the garbage bins? Answer type- Yes/No/Unknown
(4) Does the area around the garbage bins have standing water? Answer type- Yes/No/Unknown
(5) Are birds, insects, rodents or other animals present at the stall? Answer type- Yes/No/Unknown
Answer the questions above and return **JSON Lines**. Each line must contain the keys [qnum, short_ans, explanation].

Answers:

{"qnum":1,"short_ans":"yes","explanation":"The garbage bin appears to be a hard-material crate, likely plastic, rather than a flexible plastic bag or similar material."}
{"qnum":2,"short_ans":"no","explanation":"There are no visible cracks or holes in the garbage bin; the crate structure appears intact."}
{"qnum":3,"short_ans":"no","explanation":"There are no visible animals or insects in or around the garbage bin."}
{"qnum":4,"short_ans":"no","explanation":"There is no visible standing water around the garbage bin; the surrounding area appears dry."}
{"qnum":5,"short_ans":"no","explanation":"There are no visible animals or insects at the stall."}

Image:

Oversimplified VLLM Explanation

A Vision-Language Model (VLLM) converts text and images into tokens, maps them into a shared embedding space, and processes them using a transformer to generate outputs.

Text input

Tokenization
- Splits text into subwords or word pieces
- Example: “What is in the image?” → ["what", "is", "in", "the", "image", "?"]
- Or BPE-style: ["Wh", "at", ...]
Embedding lookup
- Token IDs (e.g., "image" → 4312) are mapped to vectors
- Sequence becomes: (sequence length, d_model)
- Example: ["what", "is", "in", "the", "image", "?"] → (6, d_model)

Oversimplified VLLM Explanation

Image input

Patchification
- Image is split into fixed-size patches (e.g., 16 × 16)
Patch embedding
- Each patch is linearly projected into a vector
- Output shape: (N_patches, d_model)
- Alternatively: a CNN feature extractor can be used to produce patch embeddings

Patchification (ViT) — Dosovitskiy et al. (2020), An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.

Oversimplified VLLM Explanation

Token fusion

Two common design patterns for combining vision and text tokens:

Concatenation (early fusion)
- Image tokens and text tokens are concatenated into a single sequence
- The transformer processes them jointly: [image tokens] + [text tokens]
- Interaction happens implicitly through self-attention
Cross-attention (modular fusion)
- Text and image streams are kept separate
- Fusion happens via cross-attention layers:
- Queries (Q): text tokens
- Keys/Values (K, V): image tokens
- Text tokens explicitly attend to visual tokens

Oversimplified VLLM Explanation

Attention

For a token embedding matrix X,

Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V

Attention matrix:

A = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)

Output:

\mathrm{Attention}(Q,K,V) = AV

Q: queries (“what am I looking for?”)
K: keys (“what information is available?”)
V: values (“the information to retrieve”)
Each token attends to other relevant tokens.

An example of attention matrix visualization — Vaswani, Ashish, et al. (2017), Attention Is All You Need.

Oversimplified VLLM Explanation

Transformer block

A transformer is built by stacking many identical blocks, which typically contains

Multi-head self-attention
Feed-forward network (MLP)
Layer normalization
Residual connections

The output of one block becomes the input to the next block.

An example of transformer block — Dosovitskiy et al. (2020), An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.

Oversimplified VLLM Explanation

Text generation

After all transformer blocks, the model produces an updated embedding for each token.

To predict the next token, we use the embedding at the last position h_t = \mathrm{Transformer}(X)_t because:

it summarizes the current context
it has attended to all previous tokens

A linear layer maps the final embedding into a probability distribution over the vocabulary, and the next token is selected/sampled from this distribution.

p(y_t) = \mathrm{softmax}(h_t W_{\text{vocab}} + b)

The predicted token is appended to the sequence and fed back into the model.

[\text{image tokens}, \text{text tokens}] \rightarrow y_t \rightarrow [\text{image tokens}, \text{text tokens}, y_t]

This process repeats until an end-of-sequence token is generated.

A VLLM does not generate an entire sentence at once. It predicts one token at a time, conditioned on all previous image and text tokens.

Limitations of using ChatGPT / cloud-based VLLMs

Cost

Not suitable for large-scale deployment due to high usage costs. Local VLLMs are significantly cheaper, especially when deployed on edge devices such as mobile phones.
Privacy and security

Recorded scenes may contain sensitive or personally identifiable information. On-device or local-server processing provides stronger privacy guarantees and better compliance with ethical requirements.
Latency

Cloud-based VLLMs (e.g. ChatGPT) can introduce non-trivial inference delays, limiting real-time applicability.
Task-specific performance trade-off

While general-purpose VLLMs are highly capable, lightweight models can achieve comparable or even superior performance for specific tasks when properly fine-tuned with clean, domain-specific data.

Lightweight open VLLM

PaliGemma 2 (3B, ~6GB):

Team: Google DeepMind
Architecture: Vision encoder + decoder-only LLM
Visual tokens are flattened and prepended to text tokens
Fusion via causal self-attention (no explicit cross-attention layers)

Florence-2 (0.23B, ~1GB):

Team: Azure AI, Microsoft
Architecture: Unified encoder–decoder Transformer
Text decoder uses cross-attention over image embeddings
Supports task-prompted multitask learning (“<CAPTION>”, “<OD>”, “<OCR>”, etc.)

Hard distillation

Teacher output (hard labels)

The teacher model generates a target token sequence

\mathbf{y}^{*} = (y_1^{*}, y_2^{*}, \ldots, y_T^{*}), \qquad y_t^{*} \in \{1,\ldots,V\},

where V is the vocabulary size.

Student prediction

For each position t, the student predicts a probability distribution over the vocabulary

\mathbf{p}_t = (p_t^{(1)}, p_t^{(2)}, \ldots, p_t^{(V)}), \qquad \sum_{v=1}^{V} p_t^{(v)} = 1.

Training objective

The teacher-selected token is treated as the ground-truth label and the student is trained using cross-entropy loss

\mathcal{L}_{\text{hard}} = -\frac{1}{T} \sum_{t=1}^{T} \log p_t^{(y_t^{*})}.

Results - YOLOv10

Results - VLLM

Officer + field expert data as ground truth (780 cases)

Metric	ChatGPT-4o	PaliGemma 2	Florence-2
Accuracy	0.77	0.74	0.74
Precision	0.77	0.70	0.71
Recall	0.68	0.73	0.69
F1-score	0.72	0.71	0.70

250 Disagreement cases (officer ≠ ChatGPT-4o)

Metric	ChatGPT-4o	PaliGemma 2	Florence-2	Officers
Accuracy	0.69	0.71	0.73	0.60
Precision	0.68	0.66	0.71	0.56
Recall	0.59	0.73	0.70	0.52
F1-score	0.64	0.69	0.70	0.54

In cases of disagreement between ChatGPT and officers, field experts are consulted to provide authoritative labels.
ChatGPT performs best among VLLMs, while the distilled models also perform well, achieving over 70% accuracy.
For disagreement cases, Florence-2 performs best, followed by PaliGemma 2 and ChatGPT-4o, with officers showing the lowest performance.

MTurk experiments

Neither ChatGPT nor officer annotations can be treated as absolute ground truth, as both contain inevitable errors. One effective way to reduce this noise is to collect additional annotations from diverse sources.

We conduct two human-subject experiments on Amazon Mechanical Turk (MTurk): one based on images and another based on short video clips, covering different classes of facilities.

Image study
- Participants: 248
- Images: 4,726
- Replications: 5
Video study
- Participants: 182
- Video clips: 4,721
- Replications: 5
Worker qualification criteria
- Location: India
- HIT approval rate (across all requesters): > 90%
- Number of approved HITs: > 50

MTurk experiments

We take the mode across five independent annotations for each case (with tie-breaking preference: yes > no > unknown) to obtain an initial consensus label.

We find that even MTurk workers exhibit relatively low agreement between image-based and video-based sources on certain questions.

Category	Description	Agreement Rate
HW	Is the hand washing facility at least 1 meter above ground level?	0.971
HW	Does the hand washing facility have a lid?	0.494
HW	Is there soap available and maximum an arm’s length away from the tap?	0.579
HW	Is water after washing hands collected in some container? That is, it does NOT go on the ground and leaves a puddle.	0.437
WP	Is the water storage tank covered by a lid?	0.994
WP	Is the water storage tank cracked or does the tank have holes?	0.806
DW	Is the dish-washing station at least 1 meter above ground level? This is a station for washing plates, pots and cutlery.	0.899
DW	Are the dirty plates, pots, cutlery waiting to be washed on the ground or floor?	0.849
DW	Are the dirty plates, pots, cutlery waiting to be washed in a container protected from the ground?	0.855
DW	Do the buckets containing water have smooth surfaces?	0.856
DW	Is the ground around the dish-washing station free of debris (rest of food, other waste)?	0.834
DW	Is there more than one water bucket around the dish-washing area (e.g. one for cleaning and one for rinsing)?	0.859
DW	Is there soap or detergent available and maximum an arm’s length away from the station?	0.881
GB	Are the garbage bins made of hard material? E.g. metal, hard plastic	0.981
GB	Do the garbage bins have a smooth top area?	0.715
GB	Are there animals/insects in or around the garbage bins?	0.320
GB	Does the area around the garbage bins have standing water?	0.509
GB	Are birds, insects, rodents or other animals present at the stall?	0.345
CT	Is there a food preparation area? That is, a dedicated space for preparing food?	0.982
CT	Is the top area (e.g. counter top) for food preparation waterproof?	0.535
CT	Is the top area for food preparation (e.g. a counter top) cracked or does it have holes?	0.562
CT	Can the whole top area for food preparation (e.g. a counter top) be easily accessed for cleaning and drying?	0.678
DS	Is the prepared food displayed protected from direct exposure to sun/rain?	0.969
DS	Is the stall where food is prepared and sold positioned under a roof?	0.653
DS	Is the cooked or raw food out of the sun?	0.711

MTurk experiments

For the animal/insect task, video-based annotations show a relatively high agreement rate with officers. In contrast, for the water collection task, the agreement between video-based annotations and officers is much lower, while image-based annotations achieve a comparatively higher agreement rate.

Truth discovery

Goal

Given multiple sources providing conflicting claims, estimate:

Source trustworthiness T(s)
Claim belief B(c)

These methods are iterative, deterministic, and score-based algorithms. Most are heuristic approaches inspired by graph and hyperlink ranking algorithms.

Initialize claim beliefs B^{(0)}(c)
Update source trustworthiness T^{(i)}(s)
Update claim belief B^{(i)}(c)
Repeat until convergence

Representative truth discovery algorithms

Sums (Kleinberg, 1999)

T^{(i)}(s)=\sum_{c\in C_s} B^{(i-1)}(c),\quad B^{(i)}(c)=\sum_{s\in S_c} T^{(i)}(s)

Normalize after each iteration:

T^{(i)}(s)\leftarrow \frac{T^{(i)}(s)} {\max_s T^{(i)}(s)}, \qquad B^{(i)}(c)\leftarrow \frac{B^{(i)}(c)} {\max_c B^{(i)}(c)}

AverageLog (Pasternack & Roth, 2010)

T^{(i)}(s) = \log|C_s| \cdot \frac{\sum_{c\in C_s} B^{(i-1)}(c)} {|C_s|}, \quad B^{(i)}(c) = \sum_{s\in S_c} T^{(i)}(s)

TruthFinder (Yin et al., 2008): assumes that each source’s trustworthiness is the probability of it being correct

T^{(i)}(s) = \frac{\sum_{c\in C_s} B^{(i-1)}(c)} {|C_s|},\quad B^{(i)}(c) = 1-\prod_{s\in S_c} \left(1-T^{(i)}(s)\right)

Truth discovery as statistical modelling

Rather than using heuristic trust scores, the Dawid–Skene model (1979) treats the true label of each item as an unobserved (latent) variable and estimates it statistically.

For rater r the probability of reporting category k given the true category j is:

\pi^{(r)}_{jk} = P(y_{ir}=k \mid z_i=j) ,

where z_i is the latent true category of item i, and y_{ir} is the category reported by rater r for item i.

Assuming the raters conditionally independent given the true category,

P(y_i \mid z_i=j) = \prod_{r=1}^{R} P(y_{ir}\mid z_i=j)

The posterior probability of the true category is then

P(z_i=j \mid y_i) \propto P(z_i=j) \prod_{r=1}^{R} P(y_{ir}\mid z_i=j) ,

where p(z_i=j) is the prior class prevalence.

Estimation of the Dawid–Skene model

Bayesian formulation

The rater R package (Pullin, Vukcevic & Saxhaug, 2025) implements a Bayesian Dawid–Skene model for repeated categorical annotations:

p \sim \text{Dirichlet}(\alpha), \qquad \pi^{(r)}_{j,\cdot} \sim \text{Dirichlet}(\beta^{(r)}_{j})

Latent true labels are treated as hidden variables and integrated out analytically: P(y_i \mid \pi, p) = \sum_{j=1}^{K} p_j \prod_{r=1}^{R} \pi^{(r)}_{j,y_{ir}}

Full Bayesian inference is performed via Stan, yielding posterior distributions over:
- class prevalence p
- rater-specific confusion matrices \pi^{(r)}

Results - Dawid–Skene model

We define each item as a combination of vendor ID + video recording date + question.

We treat MTurk (image mode), MTurk (video mode), officer and ChatGPT as raters (distilled models are excluded).

The response space is: \{\text{yes}, \text{no}, \text{unknown}\} with \hat{\boldsymbol{\pi}} = (0.323, 0.606, 0.071).

Rater	Agreement rate with estimated truth
ChatGPT mode	0.852
MTurk (image)	0.824
Officer	0.737
MTurk (video)	0.709
Florence-2 mode	0.672
PaliGemma mode	0.663

Confidence score for each response

We compute the probability that the true class of item i is j, given a single rater response y_{ir}=k, by combining the Dawid–Skene posterior with the rater’s confusion model:

P(z_i = j \mid y_{ir} = k) = \frac{ \hat{\gamma}_{ij}\hat{\pi}^{(r)}_{jk} }{ \sum_{j'} \hat{\gamma}_{ij'}\hat{\pi}^{(r)}_{j'k} }, where

\hat{\gamma}_{ij} = P(z_i = j \mid \mathbf{y}_i): posterior class probability from the Dawid–Skene model
\hat{\pi}^{(r)}_{jk} = P(y_{ir}=k \mid z_i=j): rater-specific confusion probability (how likely rater r outputs label k when truth is j)

Vendor	Date	Question	Rater	Response	Estimated truth	Response confidence
1001	2024-05-18	Is there soap or detergent available and maximum an arm’s length away from the station?	mturk_image_mode	yes	yes	0.962
1001	2024-05-18	Is there soap or detergent available and maximum an arm’s length away from the station?	mturk_video_mode	yes	yes	0.951
1001	2024-05-18	Are the garbage bins made of hard material? E.g. metal, hard plastic	mturk_image_mode	yes	yes	0.999
1001	2024-05-18	Are the garbage bins made of hard material? E.g. metal, hard plastic	mturk_video_mode	yes	yes	0.999
1001	2024-05-18	Do the garbage bins have a smooth top area?	mturk_image_mode	yes	no	0.547
1001	2024-05-18	Do the garbage bins have a smooth top area?	mturk_video_mode	yes	no	0.471

Model updating

After estimating latent truth labels, we fine-tune PaliGemma 2 and Florence-2 on the inferred supervision signals. No training from scratch is needed, as both models already have strong pre-trained representations.

The updated results are currently under review.

A closed, self-improving loop

Data collection → truth estimation → model updating can run as a continuous cycle:

Field officers periodically record inspection videos
They answer questions derived from images extracted from those videos
Updated ground truth estimated from the newly appended data
The entire pipeline can run on a local government server

Thanks! Any questions?

tengmcing

[email protected]