Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

0

Videos

0

Lesion Categories

0

Bounding Boxes

0

Segmentation Masks

0

Words of Text

0

Models Evaluated

Abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs.

Colonoscopy Lesion Detection Polyps MLLMs Video Segmentation VQA Agentic Workflow

Dataset Comparison

Table 1. Colon-Bench provides a broader lesion taxonomy and richer supervision compared to prior colonoscopy datasets.

Attribute	Kvasir-SEG	SUN	PolypGen	REAL-Colon	CAS-Colon	Colon-Bench (Ours)
Year	2020	2021	2023	2024	2025	2026
Focus	Segmentation	Detection	Segmentation	Detection	Anatomy	Lesion, Video Segmentation, & Text
Number of Videos	N/A	N/A	N/A	60	78	528
Total Frames/Images	1,000	158,690	6,282	2,757,723	1,961,100	464,035
Lesion Classes	1 (Polyp)	1 (Polyp)	1 (Polyp)	1 (Polyp)	N/A	14
Bounding Boxes	N/A	158,690	6,282	351,264	N/A	300,132
Segmentation Masks	1,000	N/A	6,282	N/A	Yes	213,067
Language Descriptions	N/A	N/A	N/A	N/A	N/A	Yes (133k Words)

Lesion Categories

The dataset spans 14 lesion categories identified through multi-label keyword matching over clinician-verified text fields: Sessile Polyps (411), Bleeding (252), Ulcers (160), Erythematous (112), Tumors (86), LST/Flat Polyps (85), Pedunculated Polyps (72), Angiectasia (55), Diverticulum (51), Mucosal Abnormalities (51), Crohn's (7), Hemorrhoids (5), Parasites (4), and Other (1).

Lesion Distribution and Qualitative Results

Colon-Bench Leaderboard

Summary of model performance across the four benchmark tasks: visual prompted and unprompted video Visual Question Answering (VQA) accuracy (with visual box prompts and without them), binary lesion classification precision/recall/F1, and open-vocabulary video segmentation IoU/Dice. The video segmentation is based on 3 box detections from each MLLM prompting the same EdgeTAM tracker. Best scores per metric are shown in bold green.

Model	VQA Accuracy		Lesion Classification				Segmentation
Model	Prompted	Unprompted	Accuracy	Precision	Recall	F1	IoU	Dice

Detailed VQA results for prompted and unprompted splits.

Comparisons

Detection
Segmentation

Colon-Bench Qualitative Detection Comparisons. Each row shows a different colonoscopy video frame: (1) erythematous region, (2) bleeding, (3) tumor, (4) ulcer, (5) pedunculated polyp, (6) sessile polyp. Columns show the input, ground-truth bounding box, and predictions from each model. Correct detections are in green; false positives are in red. These detections are used in the downstream box-level detection or prompted VQA via the FlipTAM tracker.

Annotation Pipeline Overview

The Colon-Bench annotation pipeline operates in three phases: (1) Temporal Proposals: A VLM detection agent scans full-procedure colonoscopy videos to identify candidate lesion windows. (2) Spatial Annotation: Bounding-box tracking via EdgeTAM and AI-driven confirmation with visual cues add dense spatial annotations while progressively filtering false positives. (3) Human-in-the-Loop Review: A reviewing physician validates the pre-rendered clips with spatial overlays, rejecting only 11.6% of presented windows, demonstrating strong agreement with the AI filters. The accepted annotations are then used to create a comprehensive Colon-Bench for MLLMs evaluations, covering multiple video tasks: Visual Question Answering (VQA), binary lesion classification, and Open-Vocabulary Video Object Segmentation (OV-VOS).

Colon-Bench Video Examples

Annotation Tool Demo

VQA Examples

Prompted VQA

Unprompted VQA

Segmentation Examples

Drag the slider to compare segmentation masks with the original video.

Without Masks

With Masks

Without Masks

With Masks

Without Masks

With Masks

Human Annotation Examples

Accepted

A small, sessile polyp is immediately visible on the right side of the colon wall at the start of the video. It is diminutive in size (approximately 3-5mm), pale pink to whitish in color, with a smooth surface and round shape. The polyp is later examined under narrow-band imaging (NBI) and then snared for removal.

Accepted

At 13.0s, a classic angiodysplasia (vascular ectasia) becomes clearly visible on a mucosal fold. The lesion presents as a flat, bright red, well-demarcated patch characterized by a cluster of dilated, tortuous, fern-like submucosal blood vessels.

Rejected

A small sessile polyp is first clearly visible at 4.0s on the crest of a haustral fold in the upper central part of the view. It appears pale, measures approximately 3-5mm in diameter, and has a slightly granular surface. A second, smaller polypoid lesion is also visible adjacent to it on the left side of the same fold as the camera moves closer.

Accepted

A pale, flat, sessile polyp becomes visible at 8.0s on the inferior wall (6 o'clock position) of the colon. The lesion is located on a haustral fold, appearing slightly elevated with a whitish/yellowish surface that contrasts with the surrounding pink vascular mucosa, features often associated with a sessile serrated lesion.

Rejected

A localized mucosal abnormality is visible on the left wall of the colon (approximately 8 o'clock position) from the very beginning of the clip. It appears as a flat or slightly depressed lesion with a yellowish-white fibrinous base or mucus cap, surrounded by a faint erythematous rim, resembling a superficial ulcer, biopsy site, or a sessile serrated lesion.

Rejected

A small, sessile, elevated lesion is immediately visible at 0.0s on the right colon wall, appearing as a smooth, dome-shaped mound with a distinct central depression (umbilication) and intact overlying vascular pattern, suggestive of a subepithelial lesion or neuroendocrine tumor.

Colon-Skill

We analyze common VQA errors from MLLMs to construct a novel colon-skill prompting strategy — a structured SKILL.md context file extracted by analyzing error patterns across lesion categories and failure modes. This skill is augmented to the MLLMs during VQA benchmarking, improving zero-shot performance by up to 9.7%.

SKILL.md — Colon-Skill

Universal Anti-Error Rules

Do not hallucinate stalks: The most common morphological error is calling a sessile (broad-based) polyp "pedunculated." If the lesion sits flatly on a haustral fold without a distinct, narrowed fibrous tether, it is a sessile or flat-elevated lesion.
Correctly interpret NBI (Narrow-Band Imaging): Under NBI, healthy mucosa appears greenish/cyan. Adenomas typically appear brownish with regular pit patterns, while hyperplastic or SSLs appear pale or whitish. Do not confuse white-light colors with NBI colors.
Differentiate holes from masses: Models consistently mistake diverticula (dark, hollow outpouchings) for depressed flat lesions or dark polyps. If a finding is a perfectly circular dark shadow with smooth margins, it is a hole/pocket, not a lesion.
Be conservative with size estimates: Models routinely overestimate diminutive lesions. Without tool reference, subtle, dome-shaped nodules on folds are usually 3–5mm (diminutive). Lesions occupying a large portion of the lumen are >15mm.
Accurately identify interventions:
- Water jet: Used for irrigation/clearing mucus or blood (not for coagulation).
- Needle catheter: Injects fluid to create a blue submucosal cushion (lifts flat lesions).
- Cold/Hot Snare: Wire loop used for Endoscopic Mucosal Resection (EMR) or polypectomy.

Lesion Morphology Cues by Category

Sessile Polyps (Paris Is / IIa): Broad base, typically on the crest of a haustral fold. Often pale, isochromatic, or slightly yellowish. Surface is smooth or subtly granular.
Sessile Serrated Lesions (SSL): Flat, subtle, pale/isochromatic. Key visual signatures include a "cloud-like" surface texture, indistinct/blurred borders, and a distinct "mucus cap."
Angioectasia / Angiodysplasia: Completely flush with mucosa. Bright, cherry-red, "fern-like" or stellate pattern of tortuous, dilated submucosal blood vessels.
Pedunculated Polyps (Paris Ip): Distinct stalk attached to the colonic wall. Head is bulbous, multi-lobulated, and noticeably redder than the stalk.
Ulcers: Deep or shallow punched-out depressions. White/yellow necrotic slough or fibrin exudate in the center, surrounded by raised, rolled, erythematous margins.
Diverticula: Distinct, dark, circular pockets with smooth, well-defined rims. Often seen in clusters.

Common Confusion Traps

Angioectasia vs. Suction Artifact vs. Dieulafoy: Suction artifacts are tiny, non-specific red spots. Dieulafoy/ulcers involve tissue defects. Angioectasia is a flat network of fern-like, branching dilated vessels without a mucosal defect.
Diverticulum vs. Depressed Lesion (Paris IIc): A diverticulum is a true anatomical hole (deep black center). A Paris IIc lesion is a mucosal depression with a visible base and irregular margins.
Mischaracterizing pale flat lesions: If a flat lesion is pale with a "cloud-like" surface and lacks rolled margins or central necrosis, it is an SSL, not an ulcer or tumor.
"Lobulated" vs. "Smooth" surface: Smooth dome-shaped pale nodules are typically hyperplastic or small sessile polyps. Lobulated/cerebriform surfaces with redder hues indicate adenomatous polyps.
Anatomical folds mistaken for masses: Prominent haustral folds curve predictably. Do not label a normal fold as a "sessile mass" unless there is a localized change in color, vascularity, or elevation.

Fast VQA Decision Checklist

Assess the geometry: Mass (protruding), defect (depressed/ulcerated), hole (diverticulum), or flat vascular anomaly?
Evaluate the attachment: Broadly attached to a fold (sessile) or hanging by a tether (pedunculated)?
Check the lighting mode: White light (pink/red hues) or NBI/BLI (green/brown/cyan hues)?
Examine the surface & margins: Smooth, cloud-like, granular, or lobulated? Margins sharp, rolled, or indistinct?
Identify active tools or residue: Mucus/stool, active bleeding, wire loop (snare), or blue fluid (submucosal injection)?

Colon-Skill. The SKILL.md file used as context augmented to the MLLMs in the VQA benchmarks. The skill is extracted by analysing error patterns under lesion categories and examples of failure modes.

Colon-Skill Results

Colon-Skill improves prompted VQA accuracy by up to 9.7%.

License

Colon-Bench is intended strictly for academic research, and any form of commercial use is prohibited. Copyright for all videos is retained by their owners. The dataset as a whole is licensed under the Creative Commons Attribution (CC BY) license, consistent with the licensing of the original REAL-COLON dataset.

BibTeX

@InProceedings{ HamdiAbd_ColonBench_MICCAI2026,
   author = { Hamdi, Abdullah AND Yang, Changchun AND Gao, Xin },
   title = { { Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos } },
   booktitle = {Medical Image Computing and Computer Assisted Intervention -- MICCAI 2026},
   year = {2026},
   publisher = {Springer Nature Switzerland},
}

🔬 Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

MICCAI 2026

Abstract

Dataset Comparison

Lesion Categories

Colon-Bench Leaderboard

Results Visualization

Comparisons

Annotation Pipeline Overview

Colon-Bench Video Examples

Annotation Tool Demo

VQA Examples

Segmentation Examples

Human Annotation Examples

Colon-Skill

Universal Anti-Error Rules

Lesion Morphology Cues by Category

Common Confusion Traps

Fast VQA Decision Checklist

Colon-Skill Results

Related Links

License

BibTeX