Counter top (CT)
SSA ACT branch talk
RSFAS, ANU
Hi! I am Patrick Li.
I completed my PhD in Statistics at the Department of Econometrics and Business Statistics, Monash University.
I am a postdoctoral researcher at ANU contributing to the Analytics for the Australian Grains Industry (AAGI) project, where my work centres on machine learning, image analytics, and plant phenotyping.

A/Prof Klaus Ackermann
Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia

A/Prof Denni Tommasi
Department of Economics, Bologna, University of Bologna, Italy
Chernikov, A., Ackermann, K., Brown, C., & Tommasi, D. (2025, April). Leveraging computer vision and visual LLMs for cost-effective and consistent street food safety assessment in Kolkata India. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 39, No. 27, pp. 27914-27921).
Brown, C., & Tommasi, D. (2025). Quality Upgrading in the Street Food Market: Is Better Equipment and Training Sufficient?.
There are over 600M cases of foodborne illnesses and 420K deaths annually from contaminated food (World Health Organization 2020).
Street food is a major source of employment and an important food supply in many developing countries, but it is also often associated with significant food safety concerns.


Street food safety inspections are primarily conducted through human surveys, which are often:
Resource-intensive
Limited in coverage
Prone to bias
In 2024–2025, an FAO-funded field study was conducted in Kolkata, India, following the survey protocol of Brown and Tommasi (2025).
A video collected at 22 July 2024
| Category | Description |
|---|---|
| HW | Is the hand washing facility at least 1 meter above ground level? |
| HW | Does the hand washing facility have a lid? |
| HW | Is there soap available and maximum an arm’s length away from the tap? |
| HW | Is water after washing hands collected in some container? That is, it does NOT go on the ground and leaves a puddle. |
| WP | Is the water storage tank covered by a lid? |
| WP | Is the water storage tank cracked or does the tank have holes? |
| DW | Is the dish-washing station at least 1 meter above ground level? This is a station for washing plates, pots and cutlery. |
| DW | Are the dirty plates, pots, cutlery waiting to be washed on the ground or floor? |
| DW | Are the dirty plates, pots, cutlery waiting to be washed in a container protected from the ground? |
| DW | Do the buckets containing water have smooth surfaces? |
| DW | Is the ground around the dish-washing station free of debris (rest of food, other waste)? |
| DW | Is there more than one water bucket around the dish-washing area (e.g. one for cleaning and one for rinsing)? |
| DW | Is there soap or detergent available and maximum an arm’s length away from the station? |
| GB | Are the garbage bins made of hard material? E.g. metal, hard plastic |
| GB | Do the garbage bins have a smooth top area? |
| GB | Are there animals/insects in or around the garbage bins? |
| GB | Does the area around the garbage bins have standing water? |
| GB | Are birds, insects, rodents or other animals present at the stall? |
| CT | Is there a food preparation area? That is, a dedicated space for preparing food? |
| CT | Is the top area (e.g. counter top) for food preparation waterproof? |
| CT | Is the top area for food preparation (e.g. a counter top) cracked or does it have holes? |
| CT | Can the whole top area for food preparation (e.g. a counter top) be easily accessed for cleaning and drying? |
| DS | Is the prepared food displayed protected from direct exposure to sun/rain? |
| DS | Is the stall where food is prepared and sold positioned under a roof? |
| DS | Is the cooked or raw food out of the sun? |
A video can be represented as a sequence of frames (images) F_t, for t = 1, \dots, T.
While humans can answer questions by watching the full video, computer vision models often benefit from selecting a smaller set of representative frames for efficient question answering.
The pipeline consists of six stages:
For each video frame F_t, compute the average optical flow magnitude
M(F_t)=\frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H} \sqrt{u(x,y)^2+v(x,y)^2},
where W and H are the frame width and height, and (u,v) is the optical flow estimated between consecutive frames F_{t-1} and F_t using the Gunnar–Farneback algorithm.
A frame is selected if M(F_t) < \tau_{\text{movement}}.
Interpretation
We compute sharpness using the variance of the Laplacian response:
S(F_t) = \text{Var}(\nabla^2 F_t) = \frac{1}{WH}\sum_{x=1}^{W}\sum_{y=1}^{H}(L(x,y)-\bar{L})^2,
where L(x,y) = \frac{\partial^2 F_t}{\partial x^2} + \frac{\partial^2 F_t}{\partial y^2} is the Laplacian response at pixel (x,y), and \bar{L} is its mean over the frame.
A frame is selected if S(F_t) \ge \tau_{\text{sharp}}.
Interpretation
Each frame F_t is encoded using perceptual hashing:
h_t = \text{pHash}(F_t) \in \{0,1\}^{64}
The similarity between frame i and j is measured via Hamming distance:
d_{ij} = \sum_{k=1}^{64} \mathbf{1}(h_{ik} \ne h_{jk})
A frame F_j is removed if there exists a previous frame F_{i} such that d_{ij} \le \delta_{\text{dup}}.
Interpretation
For each selected frame F_t, a pre-trained YOLOv10 model is applied to detect and localize objects in each frame:
\{(c_k, p_k, \mathbf{b}_k)\}_{k=1}^{K},
where c_k is the class label, p_k \in [0,1] is the confidence score, and \mathbf{b}_k = [x, y, w, h] denotes the bounding box (center coordinates and size). Here, K is the total number of detected objects in the frame.
An object is selected if p_k \ge \tau_{\text{od}}.
For each retained detection, an object crop C_k is extracted from the frame for further processing.
YOLOv10 training
After Steps 1–3, we extracted frames from videos
Surveyors manually annotated bounding boxes
Dataset split:

For each object crop C_k, a pre-trained MobileNetV3 is used to extract a compact feature representation:
\mathbf{f}_k = \phi(C_k) \in \mathbb{R}^{64},
where \phi(\cdot) denotes the feature extractor and the output is taken from the global pooling layer.
MobileNetV3 training
The same dataset used for YOLOv10 training is used.
A pre-trained MobileNetV3 encodes each object crop into a low-dimensional feature vector.
The resulting feature \mathbf{f}_k captures the visual semantics of the object (e.g., shape, texture, appearance).
For each class c, we take all extracted feature vectors
\{\mathbf{f}_k\}_{k=1}^{N_c}.
We apply K-means clustering using Euclidean distance to group similar object instances. The number of clusters is set adaptively as:
K_c = \min(5, N_c)
Each cluster is represented by a centroid (the mean feature vector of that group).
For each cluster, we select the single feature vector that is closest to its centroid, and keep the corresponding crop as the representative sample.
Interpretation
Counter top (CT)
Food displayed area (DS)
Dish-washing area (DW)
Garbage bin (GB)
Hand washing facility (HW)
Water storage tank (WP)
Class-specific questionnaires + corresponding images → ChatGPT-4o
Text prompt:
(1) Are the garbage bins made of hard material? E.g. metal, hard plastic? Answer type- Yes/No/Unknown
(2) Do the garbage bins have a smooth top area? Answer type- Yes/No/Unknown
(3) Are there animals/insects in or around the garbage bins? Answer type- Yes/No/Unknown
(4) Does the area around the garbage bins have standing water? Answer type- Yes/No/Unknown
(5) Are birds, insects, rodents or other animals present at the stall? Answer type- Yes/No/Unknown
Answer the questions above and return **JSON Lines**. Each line must contain the keys [qnum, short_ans, explanation].
Answers:
{"qnum":1,"short_ans":"yes","explanation":"The garbage bin appears to be a hard-material crate, likely plastic, rather than a flexible plastic bag or similar material."}
{"qnum":2,"short_ans":"no","explanation":"There are no visible cracks or holes in the garbage bin; the crate structure appears intact."}
{"qnum":3,"short_ans":"no","explanation":"There are no visible animals or insects in or around the garbage bin."}
{"qnum":4,"short_ans":"no","explanation":"There is no visible standing water around the garbage bin; the surrounding area appears dry."}
{"qnum":5,"short_ans":"no","explanation":"There are no visible animals or insects at the stall."}Image:

A Vision-Language Model (VLLM) converts text and images into tokens, maps them into a shared embedding space, and processes them using a transformer to generate outputs.
Text input
Tokenization
["what", "is", "in", "the", "image", "?"]["Wh", "at", ...]Embedding lookup
"image" → 4312) are mapped to vectors(sequence length, d_model)["what", "is", "in", "the", "image", "?"] → (6, d_model)Image input
Patchification
16 × 16)Patch embedding
(N_patches, d_model)Patchification (ViT) — Dosovitskiy et al. (2020), An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.
Token fusion
Two common design patterns for combining vision and text tokens:
[image tokens] + [text tokens]Attention
For a token embedding matrix X,
Q = XW_Q,\qquad K = XW_K,\qquad V = XW_V
Attention matrix:
A = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)
Output:
\mathrm{Attention}(Q,K,V) = AV
An example of attention matrix visualization — Vaswani, Ashish, et al. (2017), Attention Is All You Need.
Transformer block
A transformer is built by stacking many identical blocks, which typically contains
The output of one block becomes the input to the next block.
An example of transformer block — Dosovitskiy et al. (2020), An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.
Text generation
After all transformer blocks, the model produces an updated embedding for each token.
To predict the next token, we use the embedding at the last position h_t = \mathrm{Transformer}(X)_t because:
A linear layer maps the final embedding into a probability distribution over the vocabulary, and the next token is selected/sampled from this distribution.
p(y_t) = \mathrm{softmax}(h_t W_{\text{vocab}} + b)
The predicted token is appended to the sequence and fed back into the model.
[\text{image tokens}, \text{text tokens}] \rightarrow y_t \rightarrow [\text{image tokens}, \text{text tokens}, y_t]
This process repeats until an end-of-sequence token is generated.
A VLLM does not generate an entire sentence at once. It predicts one token at a time, conditioned on all previous image and text tokens.
Cost
Not suitable for large-scale deployment due to high usage costs. Local VLLMs are significantly cheaper, especially when deployed on edge devices such as mobile phones.
Privacy and security
Recorded scenes may contain sensitive or personally identifiable information. On-device or local-server processing provides stronger privacy guarantees and better compliance with ethical requirements.
Latency
Cloud-based VLLMs (e.g. ChatGPT) can introduce non-trivial inference delays, limiting real-time applicability.
Task-specific performance trade-off
While general-purpose VLLMs are highly capable, lightweight models can achieve comparable or even superior performance for specific tasks when properly fine-tuned with clean, domain-specific data.
PaliGemma 2 (3B, ~6GB):
Florence-2 (0.23B, ~1GB):

Teacher output (hard labels)
The teacher model generates a target token sequence
\mathbf{y}^{*} = (y_1^{*}, y_2^{*}, \ldots, y_T^{*}), \qquad y_t^{*} \in \{1,\ldots,V\},
where V is the vocabulary size.
Student prediction
For each position t, the student predicts a probability distribution over the vocabulary
\mathbf{p}_t = (p_t^{(1)}, p_t^{(2)}, \ldots, p_t^{(V)}), \qquad \sum_{v=1}^{V} p_t^{(v)} = 1.
Training objective
The teacher-selected token is treated as the ground-truth label and the student is trained using cross-entropy loss
\mathcal{L}_{\text{hard}} = -\frac{1}{T} \sum_{t=1}^{T} \log p_t^{(y_t^{*})}.


Officer + field expert data as ground truth (780 cases)
| Metric | ChatGPT-4o | PaliGemma 2 | Florence-2 |
|---|---|---|---|
| Accuracy | 0.77 | 0.74 | 0.74 |
| Precision | 0.77 | 0.70 | 0.71 |
| Recall | 0.68 | 0.73 | 0.69 |
| F1-score | 0.72 | 0.71 | 0.70 |
250 Disagreement cases (officer ≠ ChatGPT-4o)
| Metric | ChatGPT-4o | PaliGemma 2 | Florence-2 | Officers |
|---|---|---|---|---|
| Accuracy | 0.69 | 0.71 | 0.73 | 0.60 |
| Precision | 0.68 | 0.66 | 0.71 | 0.56 |
| Recall | 0.59 | 0.73 | 0.70 | 0.52 |
| F1-score | 0.64 | 0.69 | 0.70 | 0.54 |
Neither ChatGPT nor officer annotations can be treated as absolute ground truth, as both contain inevitable errors. One effective way to reduce this noise is to collect additional annotations from diverse sources.
We conduct two human-subject experiments on Amazon Mechanical Turk (MTurk): one based on images and another based on short video clips, covering different classes of facilities.
Image study
Video study
Worker qualification criteria

We take the mode across five independent annotations for each case (with tie-breaking preference: yes > no > unknown) to obtain an initial consensus label.
We find that even MTurk workers exhibit relatively low agreement between image-based and video-based sources on certain questions.
| Category | Description | Agreement Rate |
|---|---|---|
| HW | Is the hand washing facility at least 1 meter above ground level? | 0.971 |
| HW | Does the hand washing facility have a lid? | 0.494 |
| HW | Is there soap available and maximum an arm’s length away from the tap? | 0.579 |
| HW | Is water after washing hands collected in some container? That is, it does NOT go on the ground and leaves a puddle. | 0.437 |
| WP | Is the water storage tank covered by a lid? | 0.994 |
| WP | Is the water storage tank cracked or does the tank have holes? | 0.806 |
| DW | Is the dish-washing station at least 1 meter above ground level? This is a station for washing plates, pots and cutlery. | 0.899 |
| DW | Are the dirty plates, pots, cutlery waiting to be washed on the ground or floor? | 0.849 |
| DW | Are the dirty plates, pots, cutlery waiting to be washed in a container protected from the ground? | 0.855 |
| DW | Do the buckets containing water have smooth surfaces? | 0.856 |
| DW | Is the ground around the dish-washing station free of debris (rest of food, other waste)? | 0.834 |
| DW | Is there more than one water bucket around the dish-washing area (e.g. one for cleaning and one for rinsing)? | 0.859 |
| DW | Is there soap or detergent available and maximum an arm’s length away from the station? | 0.881 |
| GB | Are the garbage bins made of hard material? E.g. metal, hard plastic | 0.981 |
| GB | Do the garbage bins have a smooth top area? | 0.715 |
| GB | Are there animals/insects in or around the garbage bins? | 0.320 |
| GB | Does the area around the garbage bins have standing water? | 0.509 |
| GB | Are birds, insects, rodents or other animals present at the stall? | 0.345 |
| CT | Is there a food preparation area? That is, a dedicated space for preparing food? | 0.982 |
| CT | Is the top area (e.g. counter top) for food preparation waterproof? | 0.535 |
| CT | Is the top area for food preparation (e.g. a counter top) cracked or does it have holes? | 0.562 |
| CT | Can the whole top area for food preparation (e.g. a counter top) be easily accessed for cleaning and drying? | 0.678 |
| DS | Is the prepared food displayed protected from direct exposure to sun/rain? | 0.969 |
| DS | Is the stall where food is prepared and sold positioned under a roof? | 0.653 |
| DS | Is the cooked or raw food out of the sun? | 0.711 |
For the animal/insect task, video-based annotations show a relatively high agreement rate with officers. In contrast, for the water collection task, the agreement between video-based annotations and officers is much lower, while image-based annotations achieve a comparatively higher agreement rate.
Goal
Given multiple sources providing conflicting claims, estimate:
These methods are iterative, deterministic, and score-based algorithms. Most are heuristic approaches inspired by graph and hyperlink ranking algorithms.
T^{(i)}(s)=\sum_{c\in C_s} B^{(i-1)}(c),\quad B^{(i)}(c)=\sum_{s\in S_c} T^{(i)}(s)
Normalize after each iteration:
T^{(i)}(s)\leftarrow \frac{T^{(i)}(s)} {\max_s T^{(i)}(s)}, \qquad B^{(i)}(c)\leftarrow \frac{B^{(i)}(c)} {\max_c B^{(i)}(c)}
T^{(i)}(s) = \log|C_s| \cdot \frac{\sum_{c\in C_s} B^{(i-1)}(c)} {|C_s|}, \quad B^{(i)}(c) = \sum_{s\in S_c} T^{(i)}(s)
T^{(i)}(s) = \frac{\sum_{c\in C_s} B^{(i-1)}(c)} {|C_s|},\quad B^{(i)}(c) = 1-\prod_{s\in S_c} \left(1-T^{(i)}(s)\right)
Rather than using heuristic trust scores, the Dawid–Skene model (1979) treats the true label of each item as an unobserved (latent) variable and estimates it statistically.
For rater r the probability of reporting category k given the true category j is:
\pi^{(r)}_{jk} = P(y_{ir}=k \mid z_i=j) ,
where z_i is the latent true category of item i, and y_{ir} is the category reported by rater r for item i.
Assuming the raters conditionally independent given the true category,
P(y_i \mid z_i=j) = \prod_{r=1}^{R} P(y_{ir}\mid z_i=j)
The posterior probability of the true category is then
P(z_i=j \mid y_i) \propto P(z_i=j) \prod_{r=1}^{R} P(y_{ir}\mid z_i=j) ,
where p(z_i=j) is the prior class prevalence.
The rater R package (Pullin, Vukcevic & Saxhaug, 2025) implements a Bayesian Dawid–Skene model for repeated categorical annotations:
p \sim \text{Dirichlet}(\alpha), \qquad \pi^{(r)}_{j,\cdot} \sim \text{Dirichlet}(\beta^{(r)}_{j})
Latent true labels are treated as hidden variables and integrated out analytically: P(y_i \mid \pi, p) = \sum_{j=1}^{K} p_j \prod_{r=1}^{R} \pi^{(r)}_{j,y_{ir}}
Full Bayesian inference is performed via Stan, yielding posterior distributions over:
We define each item as a combination of vendor ID + video recording date + question.
We treat MTurk (image mode), MTurk (video mode), officer and ChatGPT as raters (distilled models are excluded).
The response space is: \{\text{yes}, \text{no}, \text{unknown}\} with \hat{\boldsymbol{\pi}} = (0.323, 0.606, 0.071).
| Rater | Agreement rate with estimated truth |
|---|---|
| ChatGPT mode | 0.852 |
| MTurk (image) | 0.824 |
| Officer | 0.737 |
| MTurk (video) | 0.709 |
| Florence-2 mode | 0.672 |
| PaliGemma mode | 0.663 |

We compute the probability that the true class of item i is j, given a single rater response y_{ir}=k, by combining the Dawid–Skene posterior with the rater’s confusion model:
P(z_i = j \mid y_{ir} = k) = \frac{ \hat{\gamma}_{ij}\hat{\pi}^{(r)}_{jk} }{ \sum_{j'} \hat{\gamma}_{ij'}\hat{\pi}^{(r)}_{j'k} }, where
| Vendor | Date | Question | Rater | Response | Estimated truth | Response confidence |
|---|---|---|---|---|---|---|
| 1001 | 2024-05-18 | Is there soap or detergent available and maximum an arm’s length away from the station? | mturk_image_mode | yes | yes | 0.962 |
| 1001 | 2024-05-18 | Is there soap or detergent available and maximum an arm’s length away from the station? | mturk_video_mode | yes | yes | 0.951 |
| 1001 | 2024-05-18 | Are the garbage bins made of hard material? E.g. metal, hard plastic | mturk_image_mode | yes | yes | 0.999 |
| 1001 | 2024-05-18 | Are the garbage bins made of hard material? E.g. metal, hard plastic | mturk_video_mode | yes | yes | 0.999 |
| 1001 | 2024-05-18 | Do the garbage bins have a smooth top area? | mturk_image_mode | yes | no | 0.547 |
| 1001 | 2024-05-18 | Do the garbage bins have a smooth top area? | mturk_video_mode | yes | no | 0.471 |
After estimating latent truth labels, we fine-tune PaliGemma 2 and Florence-2 on the inferred supervision signals. No training from scratch is needed, as both models already have strong pre-trained representations.
The updated results are currently under review.
A closed, self-improving loop
Data collection → truth estimation → model updating can run as a continuous cycle:
tengmcing

Slides URL: https://food-safety.patrickli.org/ | Canberra time