๐ฅ MedVidBench Leaderboard
MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding. It covers 8 tasks across 8 surgical datasets with 6,245 test samples, evaluated on 10 metrics including LLM-based caption judging.
๐ Paper ย ๐ Project Page ย ๐พ Dataset ย ๐ค Model ย ๐ป GitHub ย ๐ฎ Demo
Official Rankings (Verified)
Models on this leaderboard have been independently verified by the benchmark maintainers. We evaluate top community submissions by requesting model API access and running our evaluation pipeline directly.
This ensures reproducible and trustworthy results.
| Rank | Model | Org | CVS_acc | NAP_acc | SA_acc | STG_mIoU | TAG_mIoU@0.3 | TAG_mIoU@0.5 | DVC_F1 | DVC_llm | VS_llm | RC_llm | Verified |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ๐ฅ1 | ✅ Qwen3VL-4B-MedGRPO | UII America (Ours) | 0.898 | 0.473 | 0.285 | 0.176 | 0.504 | 0.441 | 0.480 | 3.950 | 4.227 | 3.861 | 2026-04-15 |
| ๐ฅ2 | ✅ Qwen3.5-4B-SFT | UII America (Ours) | 0.897 | 0.576 | 0.354 | 0.190 | 0.482 | 0.429 | 0.451 | 3.741 | 4.238 | 3.746 | 2026-04-15 |
| ๐ฅ3 | ✅ Qwen2.5VL-7B-MedGRPO | UII America (Ours) | 0.896 | 0.405 | 0.254 | 0.202 | 0.216 | 0.156 | 0.214 | 3.797 | 4.184 | 3.442 | 2026-04-15 |
| 4 | ✅ Qwen3VL-4B-SFT | UII America (Ours) | 0.895 | 0.466 | 0.270 | 0.133 | 0.465 | 0.403 | 0.435 | 3.862 | 4.180 | 3.752 | 2026-04-15 |
| 5 | ✅ Qwen2.5VL-7B-SFT | UII America (Ours) | 0.894 | 0.442 | 0.218 | 0.177 | 0.142 | 0.091 | 0.165 | 3.665 | 3.596 | 2.757 | 2026-04-15 |
| 6 | ✅ Qwen3.5-4B | Qwen AI | 0.309 | 0.231 | 0.276 | 0.051 | 0.074 | 0.040 | 0.142 | 2.699 | 3.491 | 3.037 | 2026-04-15 |
| 7 | ✅ Gemini-3.1-flash-lite | 0.242 | 0.406 | 0.225 | 0.059 | 0.072 | 0.049 | 0.174 | 3.198 | 3.737 | 3.492 | 2026-04-15 | |
| 8 | ✅ GPT-5.4 | OpenAI | 0.164 | 0.393 | 0.267 | 0.004 | 0.086 | 0.055 | 0.178 | 3.403 | 3.976 | 3.714 | 2026-04-15 |
| 9 | ✅ Qwen2.5VL-7B | Qwen AI | 0.105 | 0.151 | 0.010 | 0.020 | 0.006 | 0.068 | 0.075 | 2.512 | 2.452 | 2.090 | 2026-04-15 |
| 10 | ✅ Gemini-2.5-Flash | 0.101 | 0.228 | 0.107 | 0.047 | 0.045 | 0.021 | 0.084 | 2.387 | 2.352 | 1.912 | 2026-04-15 | |
| 11 | ✅ GPT-4.1 | OpenAI | 0.018 | 0.250 | 0.087 | 0.014 | 0.096 | 0.005 | 0.101 | 2.438 | 2.490 | 2.080 | 2026-04-15 |
| 12 | ✅ Qwen3VL-4B | Qwen AI | 0.000 | 0.178 | 0.006 | 0.000 | 0.039 | 0.034 | 0.128 | 1.939 | 2.926 | 2.853 | 2026-04-15 |
| 13 | ✅ Qwen2.5VL-7B-Surg-CholecT50 | NVIDIA | 0.000 | 0.302 | 0.000 | 0.000 | 0.019 | 0.013 | 0.051 | 1.945 | 2.101 | 2.986 | 2026-04-15 |
| 14 | ✅ VideoChat-R1.5-7B | OpenGVLab | 0.000 | 0.270 | 0.006 | 0.000 | 0.009 | 0.005 | 0.026 | 1.723 | 3.034 | 3.086 | 2026-04-15 |
■ Best ■ 2nd Best ✅ = Verified by benchmark maintainers via model API access
How to get on the Official Leaderboard
- Submit your model predictions via the "Community Submissions" tab
- Top performers will be contacted by the benchmark maintainers
- Provide model API access so we can independently verify results
- Once verified, your model is added to the Official Leaderboard
For questions, contact us via GitHub.
Community Submissions
Community members run inference on their own machines and upload predictions via the ๐ค Submit Results tab. Scores are then evaluated on our server against private ground truth.
| Rank | Model | Org | CVS_acc | NAP_acc | SA_acc | STG_mIoU | TAG_mIoU@0.3 | TAG_mIoU@0.5 | DVC_F1 | DVC_llm | VS_llm | RC_llm | Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ๐ฅ1 | Qwen3VL-4B-MedGRPO | UII America (Ours) | 0.898 | 0.473 | 0.285 | 0.176 | 0.504 | 0.441 | 0.480 | 3.950 | 4.227 | 3.861 | 2025-01-14 |
| ๐ฅ2 | Qwen3.5-4B-SFT | UII America (Ours) | 0.897 | 0.576 | 0.354 | 0.190 | 0.482 | 0.429 | 0.451 | 3.741 | 4.238 | 3.746 | 2026-04-13 |
| ๐ฅ3 | Qwen2.5VL-7B-MedGRPO | UII America (Ours) | 0.896 | 0.405 | 0.254 | 0.202 | 0.216 | 0.156 | 0.214 | 3.797 | 4.184 | 3.442 | 2025-01-14 |
| 4 | Qwen3VL-4B-SFT | UII America (Ours) | 0.895 | 0.466 | 0.270 | 0.133 | 0.465 | 0.403 | 0.435 | 3.862 | 4.180 | 3.752 | 2025-01-14 |
| 5 | Qwen2.5VL-7B-SFT | UII America (Ours) | 0.894 | 0.442 | 0.218 | 0.177 | 0.142 | 0.091 | 0.165 | 3.665 | 3.596 | 2.757 | 2025-01-14 |
| 6 | Qwen3.5-4B | Qwen AI | 0.309 | 0.231 | 0.276 | 0.051 | 0.074 | 0.040 | 0.142 | 2.699 | 3.491 | 3.037 | 2026-04-13 |
| 7 | Gemini-3.1-flash-lite | 0.242 | 0.406 | 0.225 | 0.059 | 0.072 | 0.049 | 0.174 | 3.198 | 3.737 | 3.492 | 2026-04-13 | |
| 8 | GPT-5.4 | OpenAI | 0.164 | 0.393 | 0.267 | 0.004 | 0.086 | 0.055 | 0.178 | 3.403 | 3.976 | 3.714 | 2026-04-13 |
| 9 | Qwen2.5VL-7B | Qwen AI | 0.105 | 0.151 | 0.010 | 0.020 | 0.006 | 0.068 | 0.075 | 2.512 | 2.452 | 2.090 | 2025-01-14 |
| 10 | Gemini-2.5-Flash | 0.101 | 0.228 | 0.107 | 0.047 | 0.045 | 0.021 | 0.084 | 2.387 | 2.352 | 1.912 | 2025-01-14 | |
| 11 | GPT-4.1 | OpenAI | 0.018 | 0.250 | 0.087 | 0.014 | 0.096 | 0.005 | 0.101 | 2.438 | 2.490 | 2.080 | 2025-01-14 |
| 12 | Qwen3VL-4B | Qwen AI | 0.000 | 0.178 | 0.006 | 0.000 | 0.039 | 0.034 | 0.128 | 1.939 | 2.926 | 2.853 | 2025-01-14 |
| 13 | Qwen2.5VL-7B-Surg-CholecT50 | NVIDIA | 0.000 | 0.302 | 0.000 | 0.000 | 0.019 | 0.013 | 0.051 | 1.945 | 2.101 | 2.986 | 2025-01-14 |
| 14 | VideoChat-R1.5-7B | OpenGVLab | 0.000 | 0.270 | 0.006 | 0.000 | 0.009 | 0.005 | 0.026 | 1.723 | 3.034 | 3.086 | 2025-01-14 |
โ Best โ 2nd Best ๐ฅ 1st ๐ฅ 2nd ๐ฅ 3rd overall
Submit Your Model Results
Evaluation is a two-step process:
| Step | What happens | Time |
|---|---|---|
| Step 1 | Upload predictions -- evaluates CVS, NAP, SA, STG, TAG, DVC_F1 | ~2-5 min |
| Step 2 | Run LLM Judge -- evaluates DVC_llm, VS_llm, RC_llm caption quality | ~10-20 min (background) |
Step 1: Upload Predictions
Upload your model's predictions only on the MedVidBench test set (6,245 samples).
Expected File Format (click to expand)[
{
"id": "video_id&&start&&end&&fps",
"qa_type": "tal",
"prediction": "Your model's answer here"
},
{
"id": "another_video&&0&&10&&1.0",
"qa_type": "video_summary",
"prediction": "The surgeon performs..."
}
]
Required fields:
id: Sample identifier (matches test data from HuggingFace dataset)qa_type: Task type (tal/stg/next_action/dense_captioning/video_summary/region_caption/skill_assessment/cvs_assessment)prediction: Your model's answer (text output)
Important: Submit predictions only (no ground truth needed). The server merges with private ground truth and evaluates securely.
Step 2: Run LLM Judge (Caption Metrics)
After Step 1 completes, the caption metrics (DVC_llm, VS_llm, RC_llm) will be 0.0. Run the LLM Judge here to compute them using GPT-4.1/Gemini.
- Enter the exact model name you used in Step 1
- The evaluation runs in the background -- you can close the browser and come back later
- Check progress anytime with the "Check Status" button
About MedVidBench
MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding, introduced in the MedGRPO paper. It spans 8 tasks across 8 surgical datasets with 6,245 test samples.
Benchmark Tasks
Evaluation Metrics
- TAL (Temporal Action Localization): mAP@0.5 - mean Average Precision at IoU threshold 0.5
- STG (Spatiotemporal Grounding): mIoU - mean Intersection over Union (spatial + temporal)
- Next Action: Accuracy - Classification accuracy
- DVC (Dense Video Captioning): LLM Judge + F1 for temporal localization
- VS (Video Summary): LLM Judge - caption quality scoring
- RC (Region Caption): LLM Judge - caption quality scoring
- Skill Assessment: Accuracy - Surgical skill level classification (JIGSAWS)
- CVS Assessment: Accuracy - Clinical variable scoring
LLM Judge Details
Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects: R2 (Relevance & Medical Terminology), R4 (Actionable Surgical Actions), R5 (Comprehensive Detail Level), R7 (Anatomical & Instrument Precision), R8 (Clinical Context & Coherence). The final score is the average across these 5 aspects.
Test Set Statistics
- Total samples: 6,245
- Source datasets: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
- Video frames: ~103,742
- Training samples: 21,060
Citation
@article{su2024medgrpo,
title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
journal={arXiv preprint arXiv:2512.06581},
year={2025}
}
License
- Dataset: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
- Leaderboard Code: Apache 2.0
- Evaluation Scripts: MIT
Contact
For questions or issues, open an issue on GitHub or visit the project page.
Admin Panel
Manage both the Official Leaderboard (verified models) and Community Submissions.
Note: Admin password is set via ADMIN_PASSWORD environment variable in HuggingFace Spaces settings.