MedVidBench Leaderboard

Official Rankings (Verified)

Models on this leaderboard have been independently verified by the benchmark maintainers. We evaluate top community submissions by requesting model API access and running our evaluation pipeline directly.

This ensures reproducible and trustworthy results.

Rank	Model	Team	CVS_acc	NAP_acc	SA_acc	STG_mIoU	TAG_mIoU@0.3	TAG_mIoU@0.5	DVC_F1	DVC_llm	VS_llm	RC_llm	Verified
🥇1	uAI-NEXUS-MedVLM-1.0b-4B-RL	UII	0.898	0.473	0.285	0.176	0.504	0.441	0.480	3.950	4.227	3.861	2026-04-15
🥈2	uAI-NEXUS-MedVLM-1.0c-4B-SFT	UII	0.897	0.576	0.354	0.190	0.482	0.429	0.451	3.741	4.238	3.746	2026-04-15
🥉3	uAI-NEXUS-MedVLM-1.0b-4B-SFT	UII	0.895	0.466	0.270	0.133	0.465	0.403	0.435	3.862	4.180	3.752	2026-04-15
4	uAI-NEXUS-MedVLM-1.0d-4B-SFT	UII	0.893	0.513	0.290	0.138	0.386	0.319	0.367	3.644	4.246	3.644	2026-05-15
5	uAI-NEXUS-MedVLM-1.0a-7B-RL	UII	0.896	0.405	0.254	0.202	0.216	0.156	0.214	3.797	4.184	3.442	2026-04-15
6	uAI-NEXUS-MedVLM-1.0a-7B-SFT	UII	0.894	0.442	0.218	0.177	0.142	0.091	0.165	3.665	3.596	2.757	2026-04-15
7	GPT-5.4*	OpenAI	0.164	0.393	0.267	0.004	0.086	0.055	0.178	3.403	3.976	3.714	2026-04-15
8	Gemini-3.1-flash-lite*	Google	0.242	0.406	0.225	0.059	0.072	0.049	0.174	3.198	3.737	3.492	2026-04-15
9	Qwen3.5-4B*	Alibaba	0.309	0.231	0.276	0.051	0.074	0.040	0.142	2.699	3.491	3.037	2026-04-15
10	GPT-4.1*	OpenAI	0.018	0.250	0.087	0.014	0.096	0.005	0.101	2.438	2.490	2.080	2026-04-15
11	Qwen2.5VL-7B*	Alibaba	0.105	0.151	0.010	0.020	0.006	0.068	0.075	2.512	2.452	2.090	2026-04-15
12	Gemini-2.5-Flash*	Google	0.101	0.228	0.107	0.047	0.045	0.021	0.084	2.387	2.352	1.912	2026-04-15
13	Qwen3VL-4B*	Alibaba	0.000	0.178	0.006	0.000	0.039	0.034	0.128	1.939	2.926	2.853	2026-04-15
14	VideoChat-R1.5-7B*	OpenGVLab	0.000	0.270	0.006	0.000	0.009	0.005	0.026	1.723	3.034	3.086	2026-04-15
15	Qwen2.5VL-7B-Surg-CholecT50	NVIDIA	0.000	0.302	0.000	0.000	0.019	0.013	0.051	1.945	2.101	2.986	2026-04-15

■ Best ■ 2nd Best ✅ = User submission verified by maintainers via model API * = off-the-shelf models

How to get on the Official Leaderboard

Submit your model predictions via the "Community Submissions" tab
Top performers will be contacted by the benchmark maintainers
Provide model API access so we can independently verify results
Once verified, your model is added to the Official Leaderboard

For questions, contact us via GitHub.

Community Submissions

Community members run inference on their own machines and upload predictions via the 📤 Submit Results tab. Scores are then evaluated on our server against private ground truth.

Rank	Model	Team	CVS_acc	NAP_acc	SA_acc	STG_mIoU	TAG_mIoU@0.3	TAG_mIoU@0.5	DVC_F1	DVC_llm	VS_llm	RC_llm	Date
🥇1	uAI-NEXUS-MedVLM-1.0e-27B-SFT	UII	0.897	0.558	0.389	0.229	0.494	0.444	0.478	3.885	4.214	3.690	2026-05-17
🥈2	uAI-NEXUS-MedVLM-1.0b-4B-RL	UII	0.898	0.473	0.285	0.176	0.504	0.441	0.480	3.950	4.227	3.861	2025-01-14
🥉3	uAI-NEXUS-MedVLM-1.0c-4B-SFT	UII	0.897	0.576	0.354	0.190	0.482	0.429	0.451	3.741	4.238	3.746	2026-04-13
4	uAI-NEXUS-MedVLM-1.0d-4B-SFT	UII	0.893	0.513	0.290	0.138	0.386	0.319	0.367	3.644	4.246	3.644	2026-05-15
5	uAI-NEXUS-MedVLM-1.0b-4B-SFT	UII	0.895	0.466	0.270	0.133	0.465	0.403	0.435	3.862	4.180	3.752	2025-01-14
6	uAI-NEXUS-MedVLM-1.0a-7B-RL	UII	0.896	0.405	0.254	0.202	0.216	0.156	0.214	3.797	4.184	3.442	2025-01-14
7	uAI-NEXUS-MedVLM-1.0a-7B-SFT	UII	0.894	0.442	0.218	0.177	0.142	0.091	0.165	3.665	3.596	2.757	2025-01-14
8	GPT-5.4*	OpenAI	0.164	0.393	0.267	0.004	0.086	0.055	0.178	3.403	3.976	3.714	2026-04-13
9	Qwen3.6-27B	happybunny	0.479	0.382	0.230	0.094	0.127	0.084	0.166	3.111	3.814	3.078	2026-04-25
10	Gemini-3.1-flash-lite*	Google	0.242	0.406	0.225	0.059	0.072	0.049	0.174	3.198	3.737	3.492	2026-04-13
11	test	test	0.891	0.400	0.242	0.168	0.209	0.152	0.215	0.000	0.000	0.000	2026-05-10
12	test2	test	0.891	0.400	0.242	0.168	0.209	0.152	0.215	0.000	0.000	0.000	2026-05-10
13	test5	test	0.891	0.400	0.242	0.168	0.209	0.152	0.215	0.000	0.000	0.000	2026-05-10
14	Qwen3.6-27B-Think	happybunny	0.374	0.257	0.249	0.000	0.124	0.084	0.176	2.200	4.246	2.459	2026-04-25
15	test4	test	0.890	0.394	0.256	0.170	0.208	0.150	0.215	0.000	0.000	0.000	2026-05-10
16	uAI-NEXUS-MedVLM-1.0a-7B-SFT-test01	uAI	0.968	0.385	0.425	0.000	0.078	0.052	0.134	2.033	1.328	1.706	2026-05-11
17	uAI-NEXUS-MedVLM-1.0a-7B-SFT-test	uAI	0.968	0.385	0.425	0.000	0.078	0.052	0.134	2.016	1.327	1.705	2026-05-11
18	Qwen3.5-4B*	Alibaba	0.309	0.231	0.276	0.051	0.074	0.040	0.142	2.699	3.491	3.037	2026-04-13
19	LongCat-Next	Meituan	0.228	0.324	0.286	0.001	0.010	0.007	0.070	3.424	3.587	3.510	2026-05-13
20	MDeFRPO-test	hhh	0.885	0.382	0.263	0.169	0.078	0.049	0.102	0.000	0.000	0.000	2026-04-29
21	gemma-4-26b-a4b-it	AJ	0.241	0.260	0.207	0.047	0.046	0.032	0.074	2.750	3.471	3.041	2026-04-23
22	Lingshu-7B	lingshu-medical-mllm	0.048	0.200	0.272	0.061	0.014	0.005	0.084	2.178	3.658	3.184	2026-05-04
23	qwen36_27b_thinkv1	ada	0.523	0.294	0.000	0.000	0.029	0.018	0.257	0.855	2.028	1.816	2026-05-14
24	GPT-4.1*	OpenAI	0.018	0.250	0.087	0.014	0.096	0.005	0.101	2.438	2.490	2.080	2025-01-14
25	test3	test	0.000	0.246	0.550	0.000	0.066	0.038	0.160	0.000	0.000	0.000	2026-05-10
26	Gemini-2.5-Flash*	Google	0.101	0.228	0.107	0.047	0.045	0.021	0.084	2.387	2.352	1.912	2025-01-14
27	qwen35at	ali	0.555	0.310	0.000	0.000	0.026	0.020	0.026	0.378	2.413	2.227	2026-05-12
28	Qwen3VL-4B*	Alibaba	0.000	0.178	0.006	0.000	0.039	0.034	0.128	1.939	2.926	2.853	2025-01-14
29	Qwen2.5VL-7B*	Alibaba	0.105	0.151	0.010	0.020	0.006	0.068	0.075	2.512	2.452	2.090	2025-01-14
30	Qwen2.5VL-7B-Surg-CholecT50	NVIDIA	0.000	0.302	0.000	0.000	0.019	0.013	0.051	1.945	2.101	2.986	2025-01-14
31	VideoChat-R1.5-7B*	OpenGVLab	0.000	0.270	0.006	0.000	0.009	0.005	0.026	1.723	3.034	3.086	2025-01-14
32	Qwen3.5-35B-A3B	alibaba	0.564	0.284	0.000	0.000	0.019	0.015	0.000	0.000	2.061	2.123	2026-05-12
33	qwen36_27b_thinkv2	ada	0.000	0.197	0.000	0.000	0.012	0.009	0.257	0.771	1.517	2.233	2026-05-14
34	BAGEL	bytedance	0.019	0.196	0.000	0.000	0.009	0.005	0.000	0.000	2.975	2.518	2026-05-09
35	BAGEL-7B	bytedance	0.019	0.196	0.000	0.000	0.009	0.005	0.000	0.000	2.979	2.514	2026-05-08
36	lance	bytedance	0.000	0.279	0.000	0.000	0.000	0.000	0.000	0.000	2.114	2.076	2026-05-23
37	sensenova-u1	sensetime	0.000	0.094	0.000	0.000	0.021	0.012	0.000	0.000	1.119	1.184	2026-05-10

■ Best ■ 2nd Best 🥇 1st 🥈 2nd 🥉 3rd overall ✅ = User submission verified by maintainers via model API ECCV'26 = MedVidU @ ECCV 2026 Challenge entry * = off-the-shelf models

Submit Your Model Results

Evaluation is a two-step process:

Step	What happens	Time
Step 1	Upload predictions -- evaluates CVS, NAP, SA, STG, TAG, DVC_F1	~2-5 min
Step 2	Run LLM Judge -- evaluates DVC_llm, VS_llm, RC_llm caption quality	~10-20 min (background)

Step 1: Upload Predictions

Upload your model's predictions only on the MedVidBench test set (6,245 samples).

Expected File Format (click to expand)

[
  {
    "id": "video_id&&start&&end&&fps",
    "qa_type": "tal",
    "prediction": "Your model's answer here"
  },
  {
    "id": "another_video&&0&&10&&1.0",
    "qa_type": "video_summary",
    "prediction": "The surgeon performs..."
  }
]

Required fields:

id: Sample identifier (matches test data from HuggingFace dataset)
qa_type: Task type (tal/stg/next_action/dense_captioning/video_summary/region_caption/skill_assessment/cvs_assessment)
prediction: Your model's answer (text output)

Important: Submit predictions only (no ground truth needed). The server merges with private ground truth and evaluates securely.

Step 2: Run LLM Judge (Caption Metrics)

After Step 1 completes, the caption metrics (DVC_llm, VS_llm, RC_llm) will be 0.0. Run the LLM Judge here to compute them using GPT-4.1/Gemini.

Enter the exact model name you used in Step 1
The evaluation runs in the background -- you can close the browser and come back later
Check progress anytime with the "Check Status" button

About MedVidBench

MedVidBench is a comprehensive benchmark for evaluating Video-Language Models on medical and surgical video understanding, introduced in the MedGRPO paper. It spans 8 tasks across 8 surgical datasets with 6,245 test samples.

How Models Are Ranked

Models are ranked by average rank across all 10 metrics — lower average rank = better. For each metric we rank every model (1 = best; ties share the smaller rank), then average those per-metric ranks. This is robust to different metric scales (accuracy 0–1 vs. LLM-judge 1–5) and rewards models that are strong across tasks rather than exceptional on one.

Global ranking across views: the rank shown is computed against the union of all submissions (official ∪ community), so the same model gets the same rank number in either the Official or the Community table — even though each table only displays a subset of rows. The Official table omits rows from the global ranking; the rank column shows each row's position in the full ranking, not its position within the visible subset.

Tiebreakers (applied in order when two models have the same average rank):

Number of metrics won outright — a model that's #1 on more metrics wins over one that ties closely on many.
Sum of per-metric ranks — catches near-ties where the mean rounded equal.
Sum of normalized scores — favors the model with marginally higher absolute scores.
Model name alphabetical — final fallback for full determinism.

Benchmark Tasks

Evaluation Metrics

CVS Assessment (CVS_acc): Accuracy on clinical variable scoring (Cholec80_CVS)
Next Action Prediction (NAP_acc): Classification accuracy for the next surgical step
Skill Assessment (SA_acc): Surgical skill level classification accuracy (JIGSAWS)
Spatiotemporal Grounding (STG_mIoU): Mean IoU over the joint spatial + temporal region
Temporal Action Grounding (TAG_mIoU@0.3, TAG_mIoU@0.5): Mean IoU over temporal segments, computed at two IoU thresholds (0.3 and 0.5)
Dense Video Captioning (DVC_F1, DVC_llm): F1 over predicted vs. ground-truth temporal windows, plus LLM-judge caption quality
Video Summary (VS_llm): LLM-judge caption quality scoring
Region Caption (RC_llm): LLM-judge caption quality scoring

LLM Judge Details

Caption tasks (DVC, VS, RC) use GPT-4.1 or Gemini-Pro with rubric-based scoring (1-5 scale) across 5 key aspects: R2 (Relevance & Medical Terminology), R4 (Actionable Surgical Actions), R5 (Comprehensive Detail Level), R7 (Anatomical & Instrument Precision), R8 (Clinical Context & Coherence). The final score is the average across these 5 aspects.

Test Set Statistics

Total samples: 6,245
Source datasets: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
Video frames: ~103,742
Training samples: 51,505

Citation

If you use our model or benchmark (MedVidBench / uAI-NEXUS-MedVLM), please cite our paper. To ensure reproducibility and acknowledge the significant investment in establishing this benchmark, please use the following official citation in any published work or public repository:

@inproceedings{su2026medgrpo,
  title     = {{MedGRPO}: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
  author    = {Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and
               Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and
               Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

License

Dataset: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
Leaderboard Code: Apache 2.0
Evaluation Scripts: MIT

Contact

For questions or issues, open an issue on GitHub or visit the project page.

Admin Panel

Manage both the Official Leaderboard (verified models) and Community Submissions.

Note: Admin password is set via ADMIN_PASSWORD environment variable in HuggingFace Spaces settings.

🏥 MedVidBench Leaderboard