๐ฅ MedVidBench Leaderboard
MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding. It covers 8 tasks across 8 surgical datasets with 6,245 test samples, evaluated on 10 metrics including LLM-based caption judging.
๐ Paper ย ๐ Project Page ย ๐พ Dataset ย ๐ค Model ย ๐ป GitHub ย ๐ฎ Demo
Official Rankings (Verified)
Models on this leaderboard have been independently verified by the benchmark maintainers. We evaluate top community submissions by requesting model API access and running our evaluation pipeline directly.
This ensures reproducible and trustworthy results.
| Rank | Model | Team | CVS_acc | NAP_acc | SA_acc | STG_mIoU | TAG_mIoU@0.3 | TAG_mIoU@0.5 | DVC_F1 | DVC_llm | VS_llm | RC_llm | Verified |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ๐ฅ1 | uAI-NEXUS-MedVLM-1.0b-4B-RL | UII | 0.898 | 0.473 | 0.285 | 0.176 | 0.504 | 0.441 | 0.480 | 3.950 | 4.227 | 3.861 | 2026-04-15 |
| ๐ฅ2 | uAI-NEXUS-MedVLM-1.0c-4B-SFT | UII | 0.897 | 0.576 | 0.354 | 0.190 | 0.482 | 0.429 | 0.451 | 3.741 | 4.238 | 3.746 | 2026-04-15 |
| ๐ฅ3 | uAI-NEXUS-MedVLM-1.0d-4B-SFT | UII | 0.893 | 0.513 | 0.290 | 0.138 | 0.386 | 0.319 | 0.367 | 3.644 | 4.246 | 3.644 | 2026-05-15 |
| 4 | uAI-NEXUS-MedVLM-1.0b-4B-SFT | UII | 0.895 | 0.466 | 0.270 | 0.133 | 0.465 | 0.403 | 0.435 | 3.862 | 4.180 | 3.752 | 2026-04-15 |
| 5 | uAI-NEXUS-MedVLM-1.0a-7B-RL | UII | 0.896 | 0.405 | 0.254 | 0.202 | 0.216 | 0.156 | 0.214 | 3.797 | 4.184 | 3.442 | 2026-04-15 |
| 6 | uAI-NEXUS-MedVLM-1.0a-7B-SFT | UII | 0.894 | 0.442 | 0.218 | 0.177 | 0.142 | 0.091 | 0.165 | 3.665 | 3.596 | 2.757 | 2026-04-15 |
| 7 | GPT-5.4* | OpenAI | 0.164 | 0.393 | 0.267 | 0.004 | 0.086 | 0.055 | 0.178 | 3.403 | 3.976 | 3.714 | 2026-04-15 |
| 9 | Gemini-3.1-flash-lite* | 0.242 | 0.406 | 0.225 | 0.059 | 0.072 | 0.049 | 0.174 | 3.198 | 3.737 | 3.492 | 2026-04-15 | |
| 17 | Qwen3.5-4B* | Alibaba | 0.309 | 0.231 | 0.276 | 0.051 | 0.074 | 0.040 | 0.142 | 2.699 | 3.491 | 3.037 | 2026-04-15 |
| 23 | GPT-4.1* | OpenAI | 0.018 | 0.250 | 0.087 | 0.014 | 0.096 | 0.005 | 0.101 | 2.438 | 2.490 | 2.080 | 2026-04-15 |
| 25 | Gemini-2.5-Flash* | 0.101 | 0.228 | 0.107 | 0.047 | 0.045 | 0.021 | 0.084 | 2.387 | 2.352 | 1.912 | 2026-04-15 | |
| 27 | Qwen3VL-4B* | Alibaba | 0.000 | 0.178 | 0.006 | 0.000 | 0.039 | 0.034 | 0.128 | 1.939 | 2.926 | 2.853 | 2026-04-15 |
| 28 | Qwen2.5VL-7B* | Alibaba | 0.105 | 0.151 | 0.010 | 0.020 | 0.006 | 0.068 | 0.075 | 2.512 | 2.452 | 2.090 | 2026-04-15 |
| 29 | Qwen2.5VL-7B-Surg-CholecT50 | NVIDIA | 0.000 | 0.302 | 0.000 | 0.000 | 0.019 | 0.013 | 0.051 | 1.945 | 2.101 | 2.986 | 2026-04-15 |
| 30 | VideoChat-R1.5-7B* | OpenGVLab | 0.000 | 0.270 | 0.006 | 0.000 | 0.009 | 0.005 | 0.026 | 1.723 | 3.034 | 3.086 | 2026-04-15 |
■ Best ■ 2nd Best ✅ = User submission verified by maintainers via model API * = off-the-shelf models
How to get on the Official Leaderboard
- Submit your model predictions via the "Community Submissions" tab
- Top performers will be contacted by the benchmark maintainers
- Provide model API access so we can independently verify results
- Once verified, your model is added to the Official Leaderboard
For questions, contact us via GitHub.
Community Submissions
Community members run inference on their own machines and upload predictions via the ๐ค Submit Results tab. Scores are then evaluated on our server against private ground truth.
| Rank | Model | Team | CVS_acc | NAP_acc | SA_acc | STG_mIoU | TAG_mIoU@0.3 | TAG_mIoU@0.5 | DVC_F1 | DVC_llm | VS_llm | RC_llm | Date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ๐ฅ1 | uAI-NEXUS-MedVLM-1.0b-4B-RL | UII | 0.898 | 0.473 | 0.285 | 0.176 | 0.504 | 0.441 | 0.480 | 3.950 | 4.227 | 3.861 | 2025-01-14 |
| ๐ฅ2 | uAI-NEXUS-MedVLM-1.0c-4B-SFT | UII | 0.897 | 0.576 | 0.354 | 0.190 | 0.482 | 0.429 | 0.451 | 3.741 | 4.238 | 3.746 | 2026-04-13 |
| ๐ฅ3 | uAI-NEXUS-MedVLM-1.0d-4B-SFT | UII | 0.893 | 0.513 | 0.290 | 0.138 | 0.386 | 0.319 | 0.367 | 3.644 | 4.246 | 3.644 | 2026-05-15 |
| 4 | uAI-NEXUS-MedVLM-1.0b-4B-SFT | UII | 0.895 | 0.466 | 0.270 | 0.133 | 0.465 | 0.403 | 0.435 | 3.862 | 4.180 | 3.752 | 2025-01-14 |
| 5 | uAI-NEXUS-MedVLM-1.0a-7B-RL | UII | 0.896 | 0.405 | 0.254 | 0.202 | 0.216 | 0.156 | 0.214 | 3.797 | 4.184 | 3.442 | 2025-01-14 |
| 6 | uAI-NEXUS-MedVLM-1.0a-7B-SFT | UII | 0.894 | 0.442 | 0.218 | 0.177 | 0.142 | 0.091 | 0.165 | 3.665 | 3.596 | 2.757 | 2025-01-14 |
| 7 | GPT-5.4* | OpenAI | 0.164 | 0.393 | 0.267 | 0.004 | 0.086 | 0.055 | 0.178 | 3.403 | 3.976 | 3.714 | 2026-04-13 |
| 8 | Qwen3.6-27B | happybunny | 0.479 | 0.382 | 0.230 | 0.094 | 0.127 | 0.084 | 0.166 | 3.111 | 3.814 | 3.078 | 2026-04-25 |
| 9 | Gemini-3.1-flash-lite* | 0.242 | 0.406 | 0.225 | 0.059 | 0.072 | 0.049 | 0.174 | 3.198 | 3.737 | 3.492 | 2026-04-13 | |
| 10 | test | test | 0.891 | 0.400 | 0.242 | 0.168 | 0.209 | 0.152 | 0.215 | 0.000 | 0.000 | 0.000 | 2026-05-10 |
| 11 | test2 | test | 0.891 | 0.400 | 0.242 | 0.168 | 0.209 | 0.152 | 0.215 | 0.000 | 0.000 | 0.000 | 2026-05-10 |
| 12 | test5 | test | 0.891 | 0.400 | 0.242 | 0.168 | 0.209 | 0.152 | 0.215 | 0.000 | 0.000 | 0.000 | 2026-05-10 |
| 13 | Qwen3.6-27B-Think | happybunny | 0.374 | 0.257 | 0.249 | 0.000 | 0.124 | 0.084 | 0.176 | 2.200 | 4.246 | 2.459 | 2026-04-25 |
| 14 | test4 | test | 0.890 | 0.394 | 0.256 | 0.170 | 0.208 | 0.150 | 0.215 | 0.000 | 0.000 | 0.000 | 2026-05-10 |
| 15 | uAI-NEXUS-MedVLM-1.0a-7B-SFT-test01 | uAI | 0.968 | 0.385 | 0.425 | 0.000 | 0.078 | 0.052 | 0.134 | 2.033 | 1.328 | 1.706 | 2026-05-11 |
| 16 | uAI-NEXUS-MedVLM-1.0a-7B-SFT-test | uAI | 0.968 | 0.385 | 0.425 | 0.000 | 0.078 | 0.052 | 0.134 | 2.016 | 1.327 | 1.705 | 2026-05-11 |
| 17 | Qwen3.5-4B* | Alibaba | 0.309 | 0.231 | 0.276 | 0.051 | 0.074 | 0.040 | 0.142 | 2.699 | 3.491 | 3.037 | 2026-04-13 |
| 18 | LongCat-Next | Meituan | 0.228 | 0.324 | 0.286 | 0.001 | 0.010 | 0.007 | 0.070 | 3.424 | 3.587 | 3.510 | 2026-05-13 |
| 19 | MDeFRPO-test | hhh | 0.885 | 0.382 | 0.263 | 0.169 | 0.078 | 0.049 | 0.102 | 0.000 | 0.000 | 0.000 | 2026-04-29 |
| 20 | gemma-4-26b-a4b-it | AJ | 0.241 | 0.260 | 0.207 | 0.047 | 0.046 | 0.032 | 0.074 | 2.750 | 3.471 | 3.041 | 2026-04-23 |
| 21 | Lingshu-7B | lingshu-medical-mllm | 0.048 | 0.200 | 0.272 | 0.061 | 0.014 | 0.005 | 0.084 | 2.178 | 3.658 | 3.184 | 2026-05-04 |
| 22 | qwen36_27b_thinkv1 | ada | 0.523 | 0.294 | 0.000 | 0.000 | 0.029 | 0.018 | 0.257 | 0.855 | 2.028 | 1.816 | 2026-05-14 |
| 23 | GPT-4.1* | OpenAI | 0.018 | 0.250 | 0.087 | 0.014 | 0.096 | 0.005 | 0.101 | 2.438 | 2.490 | 2.080 | 2025-01-14 |
| 24 | test3 | test | 0.000 | 0.246 | 0.550 | 0.000 | 0.066 | 0.038 | 0.160 | 0.000 | 0.000 | 0.000 | 2026-05-10 |
| 25 | Gemini-2.5-Flash* | 0.101 | 0.228 | 0.107 | 0.047 | 0.045 | 0.021 | 0.084 | 2.387 | 2.352 | 1.912 | 2025-01-14 | |
| 26 | qwen35at | ali | 0.555 | 0.310 | 0.000 | 0.000 | 0.026 | 0.020 | 0.026 | 0.378 | 2.413 | 2.227 | 2026-05-12 |
| 27 | Qwen3VL-4B* | Alibaba | 0.000 | 0.178 | 0.006 | 0.000 | 0.039 | 0.034 | 0.128 | 1.939 | 2.926 | 2.853 | 2025-01-14 |
| 28 | Qwen2.5VL-7B* | Alibaba | 0.105 | 0.151 | 0.010 | 0.020 | 0.006 | 0.068 | 0.075 | 2.512 | 2.452 | 2.090 | 2025-01-14 |
| 29 | Qwen2.5VL-7B-Surg-CholecT50 | NVIDIA | 0.000 | 0.302 | 0.000 | 0.000 | 0.019 | 0.013 | 0.051 | 1.945 | 2.101 | 2.986 | 2025-01-14 |
| 30 | VideoChat-R1.5-7B* | OpenGVLab | 0.000 | 0.270 | 0.006 | 0.000 | 0.009 | 0.005 | 0.026 | 1.723 | 3.034 | 3.086 | 2025-01-14 |
| 31 | Qwen3.5-35B-A3B | alibaba | 0.564 | 0.284 | 0.000 | 0.000 | 0.019 | 0.015 | 0.000 | 0.000 | 2.061 | 2.123 | 2026-05-12 |
| 32 | qwen36_27b_thinkv2 | ada | 0.000 | 0.197 | 0.000 | 0.000 | 0.012 | 0.009 | 0.257 | 0.771 | 1.517 | 2.233 | 2026-05-14 |
| 33 | BAGEL | bytedance | 0.019 | 0.196 | 0.000 | 0.000 | 0.009 | 0.005 | 0.000 | 0.000 | 2.975 | 2.518 | 2026-05-09 |
| 34 | BAGEL-7B | bytedance | 0.019 | 0.196 | 0.000 | 0.000 | 0.009 | 0.005 | 0.000 | 0.000 | 2.979 | 2.514 | 2026-05-08 |
| 35 | sensenova-u1 | sensetime | 0.000 | 0.094 | 0.000 | 0.000 | 0.021 | 0.012 | 0.000 | 0.000 | 1.119 | 1.184 | 2026-05-10 |
โ Best โ 2nd Best ๐ฅ 1st ๐ฅ 2nd ๐ฅ 3rd overall ✅ = User submission verified by maintainers via model API * = off-the-shelf models
Submit Your Model Results
Evaluation is a two-step process:
| Step | What happens | Time |
|---|---|---|
| Step 1 | Upload predictions -- evaluates CVS, NAP, SA, STG, TAG, DVC_F1 | ~2-5 min |
| Step 2 | Run LLM Judge -- evaluates DVC_llm, VS_llm, RC_llm caption quality | ~10-20 min (background) |
Step 1: Upload Predictions
Upload your model's predictions only on the MedVidBench test set (6,245 samples).
Expected File Format (click to expand)[
{
"id": "video_id&&start&&end&&fps",
"qa_type": "tal",
"prediction": "Your model's answer here"
},
{
"id": "another_video&&0&&10&&1.0",
"qa_type": "video_summary",
"prediction": "The surgeon performs..."
}
]
Required fields:
id: Sample identifier (matches test data from HuggingFace dataset)qa_type: Task type (tal/stg/next_action/dense_captioning/video_summary/region_caption/skill_assessment/cvs_assessment)prediction: Your model's answer (text output)
Important: Submit predictions only (no ground truth needed). The server merges with private ground truth and evaluates securely.
Step 2: Run LLM Judge (Caption Metrics)
After Step 1 completes, the caption metrics (DVC_llm, VS_llm, RC_llm) will be 0.0. Run the LLM Judge here to compute them using GPT-4.1/Gemini.
- Enter the exact model name you used in Step 1
- The evaluation runs in the background -- you can close the browser and come back later
- Check progress anytime with the "Check Status" button
About MedVidBench
MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding, introduced in the MedGRPO paper. It spans 8 tasks across 8 surgical datasets with 6,245 test samples.
How Models Are Ranked
Models are ranked by average rank across all 10 metrics โ lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0โ1 vs. LLM-judge 1โ5) and rewards models that are strong across tasks rather than exceptional on one.
Global ranking across views: the rank shown is computed against the union of all submissions (official โช community), so the same model gets the same rank number in either the Official or the Community table โ even though each table only displays a subset of rows. The Official table omits rows from the global ranking; the rank column shows each row's position in the full ranking, not its position within the visible subset.
Tiebreakers (applied in order when two models have the same average rank):
- Number of metrics won outright โ a model that's #1 on more metrics wins over one that ties closely on many.
- Sum of per-metric ranks โ catches near-ties where the mean rounded equal.
- Sum of normalized scores โ favors the model with marginally higher absolute scores.
- Model name alphabetical โ final fallback for full determinism.
Benchmark Tasks
Evaluation Metrics
- CVS Assessment (
CVS_acc): Accuracy on clinical variable scoring (Cholec80_CVS) - Next Action Prediction (
NAP_acc): Classification accuracy for the next surgical step - Skill Assessment (
SA_acc): Surgical skill level classification accuracy (JIGSAWS) - Spatiotemporal Grounding (
STG_mIoU): Mean IoU over the joint spatial + temporal region - Temporal Action Grounding (
TAG_mIoU@0.3,TAG_mIoU@0.5): Mean IoU over temporal segments, computed at two IoU thresholds (0.3 and 0.5) - Dense Video Captioning (
DVC_F1,DVC_llm): F1 over predicted vs. ground-truth temporal windows, plus LLM-judge caption quality - Video Summary (
VS_llm): LLM-judge caption quality scoring - Region Caption (
RC_llm): LLM-judge caption quality scoring
LLM Judge Details
Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects: R2 (Relevance & Medical Terminology), R4 (Actionable Surgical Actions), R5 (Comprehensive Detail Level), R7 (Anatomical & Instrument Precision), R8 (Clinical Context & Coherence). The final score is the average across these 5 aspects.
Test Set Statistics
- Total samples: 6,245
- Source datasets: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
- Video frames: ~103,742
- Training samples: 51,505
Citation
If you use our model or benchmark (MedVidBench / uAI-NEXUS-MedVLM), please cite our paper. To ensure reproducibility and acknowledge the significant investment in establishing this benchmark, please use the following official citation in any published work or public repository:
@inproceedings{su2026medgrpo,
title = {{MedGRPO}: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
author = {Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and
Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and
Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
License
- Dataset: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
- Leaderboard Code: Apache 2.0
- Evaluation Scripts: MIT
Contact
For questions or issues, open an issue on GitHub or visit the project page.
Admin Panel
Manage both the Official Leaderboard (verified models) and Community Submissions.
Note: Admin password is set via ADMIN_PASSWORD environment variable in HuggingFace Spaces settings.