🔬 Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

MICCAI 2026

Abdullah Hamdi, Changchun Yang, Xin Gao
King Abdullah University of Science and Technology (KAUST)

Colon-Bench is a comprehensive multi-task video benchmark for colonoscopy understanding, spanning 14 lesion categories with over 300k bounding boxes, 213k segmentation masks, and 133k words of clinical descriptions. Built via a novel agentic annotation workflow, it enables rigorous evaluation of state-of-the-art MLLMs on lesion classification, open-vocabulary video object segmentation, and video visual question answering.

0
Videos
0
Lesion Categories
0
Bounding Boxes
0
Segmentation Masks
0
Words of Text
0
Models Evaluated

Abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs.

Colonoscopy Lesion Detection Polyps MLLMs Video Segmentation VQA Agentic Workflow

Dataset Comparison

Table 1. Colon-Bench provides a broader lesion taxonomy and richer supervision compared to prior colonoscopy datasets.

Attribute Kvasir-SEG SUN PolypGen REAL-Colon CAS-Colon Colon-Bench (Ours)
Year20202021202320242025 2026
FocusSegmentationDetectionSegmentationDetectionAnatomy Lesion, Video Segmentation, & Text
Number of VideosN/AN/AN/A6078 528
Total Frames/Images1,000158,6906,2822,757,7231,961,100 464,035
Lesion Classes1 (Polyp)1 (Polyp)1 (Polyp)1 (Polyp)N/A 14
Bounding BoxesN/A158,6906,282351,264N/A 300,132
Segmentation Masks1,000N/A6,282N/AYes 213,067
Language DescriptionsN/AN/AN/AN/AN/A Yes (133k Words)

Lesion Categories

The dataset spans 14 lesion categories identified through multi-label keyword matching over clinician-verified text fields: Sessile Polyps (411), Bleeding (252), Ulcers (160), Erythematous (112), Tumors (86), LST/Flat Polyps (85), Pedunculated Polyps (72), Angiectasia (55), Diverticulum (51), Mucosal Abnormalities (51), Crohn's (7), Hemorrhoids (5), Parasites (4), and Other (1).

Lesion Distribution and Qualitative Results

Colon-Bench Leaderboard

Summary of model performance across the four benchmark tasks: visual prompted and unprompted video Visual Question Answering (VQA) accuracy (with visual box prompts and without them), binary lesion classification precision/recall/F1, and open-vocabulary video segmentation IoU/Dice. The video segmentation is based on 3 box detections from each MLLM prompting the same EdgeTAM tracker. Best scores per metric are shown in bold green.

Model VQA Accuracy Lesion Classification Segmentation
Prompted Unprompted Accuracy Precision Recall F1 IoU Dice

Results Visualization

Detailed VQA results for prompted and unprompted splits.

VQA Prompted Results VQA Unprompted Results

Comparisons

Colon-Bench Qualitative Detection Comparisons. Each row shows a different colonoscopy video frame: (1) erythematous region, (2) bleeding, (3) tumor, (4) ulcer, (5) pedunculated polyp, (6) sessile polyp. Columns show the input, ground-truth bounding box, and predictions from each model. Correct detections are in green; false positives are in red. These detections are used in the downstream box-level detection or prompted VQA via the FlipTAM tracker.

Detection Comparisons

Annotation Pipeline Overview

Colon-Bench Pipeline

The Colon-Bench annotation pipeline operates in three phases: (1) Temporal Proposals: A VLM detection agent scans full-procedure colonoscopy videos to identify candidate lesion windows. (2) Spatial Annotation: Bounding-box tracking via EdgeTAM and AI-driven confirmation with visual cues add dense spatial annotations while progressively filtering false positives. (3) Human-in-the-Loop Review: A reviewing physician validates the pre-rendered clips with spatial overlays, rejecting only 11.6% of presented windows, demonstrating strong agreement with the AI filters. The accepted annotations are then used to create a comprehensive Colon-Bench for MLLMs evaluations, covering multiple video tasks: Visual Question Answering (VQA), binary lesion classification, and Open-Vocabulary Video Object Segmentation (OV-VOS).

Colon-Bench Video Examples

Annotation Tool Demo

VQA Examples

Prompted VQA

Unprompted VQA

Segmentation Examples

Drag the slider to compare segmentation masks with the original video.

Without Masks
With Masks
Without Masks
With Masks
Without Masks
With Masks

Human Annotation Examples

Colon-Skill

We analyze common VQA errors from MLLMs to construct a novel colon-skill prompting strategy — a structured SKILL.md context file extracted by analyzing error patterns across lesion categories and failure modes. This skill is augmented to the MLLMs during VQA benchmarking, improving zero-shot performance by up to 9.7%.

SKILL.md — Colon-Skill

Universal Anti-Error Rules

  • Do not hallucinate stalks: The most common morphological error is calling a sessile (broad-based) polyp "pedunculated." If the lesion sits flatly on a haustral fold without a distinct, narrowed fibrous tether, it is a sessile or flat-elevated lesion.
  • Correctly interpret NBI (Narrow-Band Imaging): Under NBI, healthy mucosa appears greenish/cyan. Adenomas typically appear brownish with regular pit patterns, while hyperplastic or SSLs appear pale or whitish. Do not confuse white-light colors with NBI colors.
  • Differentiate holes from masses: Models consistently mistake diverticula (dark, hollow outpouchings) for depressed flat lesions or dark polyps. If a finding is a perfectly circular dark shadow with smooth margins, it is a hole/pocket, not a lesion.
  • Be conservative with size estimates: Models routinely overestimate diminutive lesions. Without tool reference, subtle, dome-shaped nodules on folds are usually 3–5mm (diminutive). Lesions occupying a large portion of the lumen are >15mm.
  • Accurately identify interventions:
    • Water jet: Used for irrigation/clearing mucus or blood (not for coagulation).
    • Needle catheter: Injects fluid to create a blue submucosal cushion (lifts flat lesions).
    • Cold/Hot Snare: Wire loop used for Endoscopic Mucosal Resection (EMR) or polypectomy.

Lesion Morphology Cues by Category

  • Sessile Polyps (Paris Is / IIa): Broad base, typically on the crest of a haustral fold. Often pale, isochromatic, or slightly yellowish. Surface is smooth or subtly granular.
  • Sessile Serrated Lesions (SSL): Flat, subtle, pale/isochromatic. Key visual signatures include a "cloud-like" surface texture, indistinct/blurred borders, and a distinct "mucus cap."
  • Angioectasia / Angiodysplasia: Completely flush with mucosa. Bright, cherry-red, "fern-like" or stellate pattern of tortuous, dilated submucosal blood vessels.
  • Pedunculated Polyps (Paris Ip): Distinct stalk attached to the colonic wall. Head is bulbous, multi-lobulated, and noticeably redder than the stalk.
  • Ulcers: Deep or shallow punched-out depressions. White/yellow necrotic slough or fibrin exudate in the center, surrounded by raised, rolled, erythematous margins.
  • Diverticula: Distinct, dark, circular pockets with smooth, well-defined rims. Often seen in clusters.

Common Confusion Traps

  • Angioectasia vs. Suction Artifact vs. Dieulafoy: Suction artifacts are tiny, non-specific red spots. Dieulafoy/ulcers involve tissue defects. Angioectasia is a flat network of fern-like, branching dilated vessels without a mucosal defect.
  • Diverticulum vs. Depressed Lesion (Paris IIc): A diverticulum is a true anatomical hole (deep black center). A Paris IIc lesion is a mucosal depression with a visible base and irregular margins.
  • Mischaracterizing pale flat lesions: If a flat lesion is pale with a "cloud-like" surface and lacks rolled margins or central necrosis, it is an SSL, not an ulcer or tumor.
  • "Lobulated" vs. "Smooth" surface: Smooth dome-shaped pale nodules are typically hyperplastic or small sessile polyps. Lobulated/cerebriform surfaces with redder hues indicate adenomatous polyps.
  • Anatomical folds mistaken for masses: Prominent haustral folds curve predictably. Do not label a normal fold as a "sessile mass" unless there is a localized change in color, vascularity, or elevation.

Fast VQA Decision Checklist

  1. Assess the geometry: Mass (protruding), defect (depressed/ulcerated), hole (diverticulum), or flat vascular anomaly?
  2. Evaluate the attachment: Broadly attached to a fold (sessile) or hanging by a tether (pedunculated)?
  3. Check the lighting mode: White light (pink/red hues) or NBI/BLI (green/brown/cyan hues)?
  4. Examine the surface & margins: Smooth, cloud-like, granular, or lobulated? Margins sharp, rolled, or indistinct?
  5. Identify active tools or residue: Mucus/stool, active bleeding, wire loop (snare), or blue fluid (submucosal injection)?

Colon-Skill. The SKILL.md file used as context augmented to the MLLMs in the VQA benchmarks. The skill is extracted by analysing error patterns under lesion categories and examples of failure modes.

Colon-Skill Results

Colon-Skill improves prompted VQA accuracy by up to 9.7%.

Colon-Skill VQA Results

License

Colon-Bench is intended strictly for academic research, and any form of commercial use is prohibited. Copyright for all videos is retained by their owners. The dataset as a whole is licensed under the Creative Commons Attribution (CC BY) license, consistent with the licensing of the original REAL-COLON dataset.

BibTeX

@InProceedings{ HamdiAbd_ColonBench_MICCAI2026,
   author = { Hamdi, Abdullah AND Yang, Changchun AND Gao, Xin },
   title = { { Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos } },
   booktitle = {Medical Image Computing and Computer Assisted Intervention -- MICCAI 2026},
   year = {2026},
   publisher = {Springer Nature Switzerland},
}