Abstract

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning—e.g., GPT-4o achieves only 6.9% accuracy—while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling'' further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

Overview

Videos

Each video features a latent state and a sequence of operations. The latent state is revealed either at the beginning or the end of the video. Throughout the rest of the video, the latent state remains invisible while the operations drive state transitions. There are six types of video demonstrations:

Number: Sliding number puzzle.
Circle: A red circle moving on grid, flipping the color of black and white pieces.
Cup: Swapping cups and the content beneath.
File: Creating, deleting, copying, and moving files within/between paths.
Card: Adding or removing cards to/from piles.
Chip: Adding or removing chips to/from cups.

Questions

The questions assess video reasoning skills across three levels:

Level 1 (Recall): Precisely recalling the sequential visual observations from the video.
Level 2 (Infer): Infer latent information that is not directly observable from the video.
Level 3 (Predict): Predict new information beyond the video.

VideoReasonBench Leaderboard

To submit your model results, please email them to liuyuanxin@stu.pku.edu.cn. You may provide either the result in JSON format or in XLSX format (see our code repo for how to evaluate).

Human Expert (240 sample subset)

Open-Source

Proprietary

Model	Act. Params	Think	Recall Order	Recall Count	Infer State	Compare State	Predict State	Predict Operation	Overall
Human	-	223.2s	87.5	90.0	80.0	75.0	67.5	42.5	73.8
Proprietary Models
Gemini-2.5-Pro-0506	-	✓	69.2	70.4	63.3	56.7	42.1	34.6	56.0
Gemini-2.5-Flash-0417	-	✓	44.6	41.7	27.9	27.1	13.8	9.6	27.4
Gemini-2.5-Flash-0417	-	✗	22.5	34.2	19.6	20.4	8.8	7.1	18.8
Seed1.5-VL	20B	✓	24.2	27.1	3.8	7.9	3.8	2.1	11.5
o4-mini	-	✓	14.2	20.4	7.1	11.7	6.2	4.6	10.7
Gemini-2.0-Flash	-	✗	18.3	22.5	6.7	6.7	5.0	3.3	10.4
GPT-4o	-	✗	14.2	15.8	4.2	6.2	0.8	0.0	6.9
Open-source Models
Flagship Models
Qwen2.5-VL	72B	✗	12.5	17.1	4.2	4.2	2.9	2.1	7.2
InternVL3	78B	✗	11.2	14.6	0.8	2.1	3.8	2.1	5.8
LLaVA-Video	72B	✗	0.0	0.0	0.0	0.0	0.4	0.0	0.1
LLaVA-OneVision	72B	✗	0.0	0.0	0.0	0.0	0.8	0.0	0.1
Efficient Models
Kimi-VL-A3B-Instruct	3B	✗	1.7	3.3	1.2	0.4	1.7	0.0	1.4
Qwen2.5-VL	7B	✗	3.8	0.8	0.4	0.0	2.1	0.8	1.3
MiniCPM-o 2.6	8B	✗	1.2	0.4	0.4	0.8	1.2	0.4	0.8
MiniCPM-V 2.6	8B	✗	2.1	0.4	0.4	0.0	1.2	0.4	0.8
InternVL3	8B	✗	0.4	0.8	0.0	0.4	1.7	0.0	0.6
LLaVA-OneVision	7B	✗	0.0	0.0	0.4	0.0	0.4	0.8	0.3
LLaVA-Video	7B	✗	0.0	0.0	0.0	0.0	0.0	0.0	0.0
mPLUG-Owl3	7B	✗	0.0	0.0	0.0	0.0	0.0	0.0	0.0

VideoReasonBench Examples

VideoReasonBench Examples - Number

Task Description: The video presents a sliding puzzle on a 3x3 board. The board is filled with numbered squares, with one empty space. Initially, all numbered squares are visually masked with a blue overlay. Then, the puzzle undergoes a series of movements-only the squares adjacent to the empty space can be shifted into it. The video ends by showing the final arrangement of the numbers on the board. Please carefully watch the video and answer the following question:

Recall Order

Question:

{Task Description}

What are the 3rd to 5th blue squares being moved? For each moved blue square, provide the following details:

The order in which it was moved (e.g., 1st, 2nd, 3rd, etc)

The coordinates before it was moved (e.g., (a,1), (c,2), etc)

The direction it was moved (e.g., left, right, up, down)

{Answer Prompt}

Ground truth: 3rd: (b,2) up, 4th: (c,2) up, 5th: (c,3) left

Recall Count

Question:

{Task Description}

How many times was the 'downward' move performed in the video? For each occurrence, provide the coordinate of the square (e.g., (a,1), (c,2)...) before the move.

{Answer Prompt}

Ground truth: 1: (a,3)

Infer State

Question:

{Task Description}

Assuming the empty square is represented by '0', what is the arrangement of numbers on the board at the start of the video? Provide the coordinates of each square along with the corresponding number (e.g., (a,1): 3, (a,2): 0, (b,1): 1, (b,2): 2).

{Answer Prompt}

Ground truth: (a,1): 2, (a,2): 1, (a,3): 7, (b,1): 4, (b,2): 3, (b,3): 0, (c,1): 6, (c,2): 5, (c,3): 8

Compare State

Question:

{Task Description}

Assuming the empty square is represented by '0', compare the number arrangements on the board at the start and end of the video. What are the squares where the numbers differ between the two boards? Provide their coordinates along with the corresponding number at the start of the video (e.g., (a,1): 3, (b,1): 1).

{Answer Prompt}

Ground truth: (a,2): 1, (a,3): 7, (b,2): 3, (b,3): 0, (c,2): 5, (c,3): 8

Predict State

Question:

{Task Description}

If the arrangement of numbers on the board is currently in the same state as it was at the start of the video, and the following moves are executed: `rightward, leftward, upward`, what will be the arrangement of numbers on the board? Assume that the empty square is represented by '0'. Provide the coordinates of each square along with the corresponding number (e.g., (a,1): 3, (a,2): 0, (b,1): 1, (b,2): 2).

{Answer Prompt}

Ground truth: (a,1): 2, (a,2): 1, (a,3): 7, (b,1): 4, (b,2): 3, (b,3): 8, (c,1): 6, (c,2): 5, (c,3): 0

Predict Operation

Question:

{Task Description}

If the arrangement of numbers on the board is currently in the same state as it was at the start of the video, what sequence of moves (left, right, up, down) should be executed to achieve the desired number arrangement: `(a,1): 2, (a,2): 1, (a,3): 7, (b,1): 4, (b,2): 5, (b,3): 3, (c,1): 6, (c,2): 8, (c,3): 0`? Assume that the empty square is represented by '0'. Note that moves cannot push any square beyond the board boundary.

{Answer Prompt}

Answer Prompt: Provide a summary of the final answer after 'Final Answer:'

VideoReasonBench Examples - Cup

Task Description:The video presents a 'Tricky Cup' puzzle on a 3x3 board. The board is filled with blue cups, each hiding either a yellow coin or nothing underneath. At the start, all cups are briefly lifted to reveal what's beneath them. Then, the cups begin a series of moves\u2014each move swaps the positions of two cups, along with their hidden contents. Please carefully watch the video and answer the following question:

Recall Order

Question:

{Task Description}

Assume that each time two cups swap their positions, it counts as one move. What are the 2nd to 4th moves shown in the video? For each move, provide the move number and the coordinates of the two cups that swapped positions. Format your response like this: 1st: (a1, b2), 2nd: (c2, b1), 3rd: (a3, c1).

{Answer Prompt}

Ground truth: 2nd: (a2, c2), 3rd: (b2, c1), 4th: (b1, c3)

Recall Count

Question:

{Task Description}

How many times were the cups in the row 'b' involved in the swaps? For each instance, provide the coordinate(s) of the cup(s) before the swap occurred. Format your response like this: 1st: a1, 2nd: a3, 3rd: (a1,a2) (Use a single coordinate for individual cups, or a tuple for multiple cups involved in the same swap.)

{Answer Prompt}

Ground truth: 1st: b2, 2nd: b2, 3rd: b1

Infer State

Question:

{Task Description}

What are the positions of all the coins at the end of the video? Provide the coordinates of each coin (e.g., a2, b1, c3).

{Answer Prompt}

Ground truth: a2, b1, c1, c3

Compare State

Question:

{Task Description}

Compare the distribution of contents beneath the cups at the start and end of the video. What are the positions where the contents beneath the cups differ between the two boards? Provide their coordinates along with the corresponding content at the end of the video. Format your response like this: a1: empty, b3: coin.

{Answer Prompt}

Ground truth: a1: empty, a2: coin, c1: coin, c2: empty

Predict State

Question:

{Task Description}

If the distribution of coins on the board is currently in the same state as it was at the end of the video, and the following cup swaps are executed in order: `(a2, b2), (b2, b3), (b2, c2), (b2, b3), (a2, c3)`, what will be the new distribution of the coins? Provide the coordinates of the coins (e.g., a1, b2).

{Answer Prompt}

Ground truth: a2, b1, b2, c1

Predict Operation

Question:

{Task Description}

If the distribution of coins on the board is currently in the same state as it was at the end of the video, what sequence of cup swaps should be executed to achieve the desired distribution of coins: `b1, c1, c2, c3`? Format your response as a list of coordinate pairs, such as: (a1, b2), (c3, b1). Each pair represents a single swap between two cups.

{Answer Prompt}

Answer Prompt: Provide a summary of the final answer after 'Final Answer:'

VideoReasonBench Examples - Circle

Task Description: The video presents a 3x3 grid. At the beginning of the video, all positions on the grid are filled with either a black or white piece. Then, these pieces are visually hidden but still remain in their original positions. A red circle then moves across the grid. Each time the red circle passes by a position on the grid (excluding the starting position), the color of the piece at that position *and* the colors of its immediate orthogonal neighbors (up, down, left, and right) are flipped: black becomes white, and white becomes black. Note that diagonal neighbors are *not* affected. Neighbors are only considered if they exist within the grid's boundaries. Please carefully watch the video and answer the following question:

Recall Order

Question:

{Task Description}

Assume that each time the red circle moves from one grid intersection to an adjacent one (horizontally or vertically), it counts as one move. What are the directions (left, right, up, down) of the 2nd to 3rd moves made by the red circle in the video? List them in order.:

{Answer Prompt}

Ground truth: down, down

Recall Count

Question:

{Task Description}

Assume that each time the red circle moves from one grid intersection to an adjacent one (horizontally or vertically), it counts as one move. Given the movement direction `left`, how many times does the red circle perform this move? For each occurrence, provide the coordinate of the position before the move (e.g., (a,1), (c,2), etc).

{Answer Prompt}

Ground truth: 2: (a,2), (c,2)

Infer State

Question:

{Task Description}

What is the arrangement of the black and white pieces on the grid at the end of the video? Provide each piece's coordinates and color using the format: (column, row): color (e.g., (a,1): black, (c,2): white).

{Answer Prompt}

Ground truth: (a, 0): black, (a, 1): black, (a, 2): black, (b, 0): black, (b, 1): white, (b, 2): black, (c, 0): white, (c, 1): white, (c, 2): black

Compare State

Question:

{Task Description}

Assume that each time the red circle moves from one grid intersection to an adjacent one (horizontally or vertically), it counts as one move. Compare the arrangement of black and white pieces on the grid at the start and end of the video. What are the coordinates where the piece color differ between the two grids? Provide these coordinates along with the corresponding piece color at the end of the video, using the format: (column, row): color (e.g., (a,1): black, (c,2): white).

{Answer Prompt}

Ground truth: (a,0): black, (a,2): black, (b,0): black, (c,2): black8

Predict State

Question:

{Task Description}

Assume that each time the red circle moves from one grid intersection to an adjacent one (horizontally or vertically), it counts as one move. If the arrangement of black and white pieces and the position of the red circle on the grid is currently in the same state as it was at the end of the video, and the following moves are executed: `right, left, up, left, right`, what will be the arrangement of black and white pieces on the grid? Provide each piece's coordinates and color using the format: (column, row): color (e.g., (a,1): black; (c,2): white).

{Answer Prompt}

Ground truth: (a,0): white, (a,1): black, (a,2): black, (b,0): white, (b,1): white, (b,2): white, (c,0): white, (c,1): white, (c,2): black

Predict Operation

Question:

{Task Description}

Assume that each time the red circle moves from one grid intersection to an adjacent one (horizontally or vertically), it counts as one move. The red circle cannot move beyond the grid boundary. If the arrangement of black and white pieces and the position of the red circle on the grid is currently in the same state as it was at the end of the video, what sequence of moves (left, right, up, down) should be executed by the red circle to achieve the desired arrangement of black and white pieces: `(a,0): black, (a,1): white, (a,2): black, (b,0): black, (b,1): black, (b,2): black, (c,0): black, (c,1): black, (c,2): white`? List them in order.

{Answer Prompt}

Answer Prompt: Provide a summary of the final answer after 'Final Answer:'

VideoReasonBench Examples - File

Task Description: The video demonstrates a series of file manipulation commands executed in the Linux command line. To ensure accurate understanding, note these assumptions:
`touch` commands: All files created by `touch` do not exist in the target directory prior to the command's execution.
`rm -r` commands: All files deleted by `rm -r` do exist in the target directory prior to the command's execution.
`cp` and `mv` commands: All source files used by `cp` and `mv` do exist in the source directory prior to the command's execution. The destination path for `cp` and `mv` commands does not contain the target files prior to the command.
Please carefully watch the video and answer the following question:

Recall Order

Question:

{Task Description}

What are the 1st to 3rd `touch` commands shown in the video? Provide the order of each command (e.g., 1st, 2nd, 3rd, etc) along with the command content.

{Answer Prompt}

Ground truth:
1st: touch path0/{s.csv,q.json,l.py,w.csv,n.csv,j.json,q.py}
2nd: touch path0/{f.csv,a.json,z.json,w.py,x.py,l.txt,a.csv}
3rd: touch path0/{e.csv,i.json,t.txt,d.json}

Recall Count

Question:

{Task Description}

How many different `.csv` files were involved in the `rm` commands throughout the video? Provide the file count along with the specific file names (e.g., 2 `.txt` files: a.txt, b.txt).

{Answer Prompt}

Ground truth: 2 `.csv` files: b.csv, n.csv

Infer State

Question:

{Task Description}

At the end of the video, how many `.py` files remain in `path0/`? Provide the file count along with the specific file names (e.g., 2 `.txt` files: a.txt, b.txt).

{Answer Prompt}

Ground truth: 4 `.py` files: v.py, x.py, q.py, w.py

Compare State

Question:

{Task Description}

What files were in `path0/` at the start of the video, but were not there at the end of the video?

{Answer Prompt}

Ground truth: b.csv, g.txt, o.txt, p.py

Predict State

Question:

{Task Description}

If the paths currently contain exactly the same files as they did at the end of the video, and we run the command `touch path0/{x.json,z.txt,f.py,w.json,h.py,t.json} & rm -rf path0/{s.csv,w.py,j.json,d.json} & rm -rf path0/{a.csv} & rm -rf path0/{a.json} & rm -rf path0/{w.csv,x.txt,t.txt,h.py}`, which `.csv` files would be in `path0`?

{Answer Prompt}

Ground truth: f.csv, n.csv, v.csv, d.csv, h.csv, e.csv

Predict Operation

Question:

{Task Description}

If the paths currently contain exactly the same files as they did at the end of the video, to ensure that `path0` contains exactly the following files: `l.txt, w.py, v.py, q.py, a.csv, d.csv, i.json, w.csv, z.json, x.txt, f.csv, y.txt, e.csv, v.csv, j.json, q.json, a.json, n.csv, d.py, d.json, f.py, t.txt`, what sequence of commands should be executed?
Rules:
1. You may only use the commands `touch` and `rm -rf`.
2. You may use at most two commands.
3. Files specified in `touch` must not appear in `rm -rf` command, and vice versa (i.e., no overlap).

Response Format:
If multiple commands are used, separate them with `&`. For example, `touch path0/{a.txt,b.txt} & rm -rf path0/{c.py,d.json}`.

{Answer Prompt}

Answer Prompt: Provide a summary of the final answer after 'Final Answer:'

VideoReasonBench Examples - Card

Task Description: The video showcases a sequence of operations involving one or more piles of cards. It begins by displaying the initial arrangement of cards in each pile from top to bottom. The cards are then turned face down, after which a series of actions is carried out. Note that there are only two types of actions: adding one card to the top of the pile or removing one card from the bottom of the pile. Please carefully watch the video and answer the following question:

Recall Order

Question:

{Task Description}

What are the 1st to 3rd cards being added to any pile throughout the video? For each card, provide the following details:
1. The order (e.g., 1st or 2nd)
2. The suit and value (e.g., 6 of Hearts)
3. The pile involved (e.g., pile0, pile1)

Format your response like this:
1st: 6 of Hearts, pile0
2nd: Jack of Spades, pile1.

{Answer Prompt}

Ground truth: 1st: Queen of Clubs, pile0, 2nd: Ace of Hearts, pile0, 3rd: Jack of Diamonds, pile0

Recall Count

Question:

{Task Description}

How many cards were added to `pile0` throughout the video? For each card, provide its suit and value (e.g., 6 of Hearts)
Format your response like this:
2 cards: 6 of Hearts, King of Clubs.

{Answer Prompt}

Ground truth: 4 cards: Jack of Clubs, Jack of Diamonds, Queen of Clubs, Ace of Hearts

Infer State

Question:

{Task Description}

At the end of the video, what cards are in `pile0`? List them in order from top to bottom, including both the value and suit of each card.
Format your response like this:
6 of Hearts, King of Clubs, 3 of Spades.

{Answer Prompt}

Ground truth: Jack of Clubs, Jack of Diamonds, Ace of Hearts, Queen of Clubs, King of Diamonds, 8 of Diamonds, 3 of Diamonds, 8 of Clubs

Compare State

Question:

{Task Description}

What cards were in `pile0` at the start of the video, but were not there at the end of the video? For each card, provide its suit and value.
Format your response like this:
6 of Hearts, King of Clubs, 3 of Spades.

{Answer Prompt}

Ground truth: 7 of Spades

Predict State

Question:

{Task Description}

If the piles currently contain exactly the same cards as they did at the end of the video, and now we perform these actions in order: `remove 8 of Clubs from pile0, remove 3 of Diamonds from pile0, remove 8 of Diamonds from pile0, remove King of Diamonds from pile0, add Jack of Spades to pile0`. What cards would be in `pile0`? List them in order from top to bottom, including both the value and suit of each card.
Format your response like this:
6 of Hearts, King of Clubs, 3 of Spades.

{Answer Prompt}

Ground truth: Jack of Spades, Jack of Clubs, Jack of Diamonds, Ace of Hearts, Queen of Clubs

Predict Operation

Question:

{Task Description}

If the piles currently contain exactly the same cards as they did at the end of the video, to ensure that `pile0` contains exactly the following cards from top to bottom: `7 of Diamonds, Jack of Clubs, Jack of Diamonds, Ace of Hearts`, what sequence of actions should be performed?
Rules:
1. Each action must either add a card to a pile or remove a card from a pile.
2. You may only add cards to the top of a pile or remove cards from the bottom of a pile.

Response Format:
List the actions in sequence, specifying the action, card, and pile. Separate each action with a comma.
For example, `add 6 of Hearts to pile0, remove King of Clubs from pile0`

{Answer Prompt}

Answer Prompt: Provide a summary of the final answer after 'Final Answer:'

VideoReasonBench Examples - Chip

Task Description: The video showcases a sequence of operations involving one or more cup(s) and chips. It begins by showing the initial chips contained in each cup. Then, a series of actions are carried out. Note that there are only two types of actions: adding one chip to a cup or removing one chip from a cup. Please carefully watch the video and answer the following question:

Recall Order

Question:

{Task Description}

Disregarding the process of revealing all chips in the cup(s), what are the 1st to 3rd chips being added to any cup throughout the video? For each chip, provide the following details:
1. The order (e.g., 1st or 2nd)
2. The value (e.g., 20)
3. The cup involved (e.g., cup0, cup1)
Format your response like this:
1st: 100, cup0
2nd: 20, cup1

{Answer Prompt}

Ground truth: 1st: 100, cup0, 2nd: 10, cup0, 3rd: 5, cup0

Recall Count

Question:

{Task Description}

Disregarding the process of revealing all chips in the cup(s), how many chips were added to `cup0` throughout the video? For each chip, provide its value (order does not matter).
Format your response like this:
4 chips: 20, 5, 100, 100.

{Answer Prompt}

Ground truth: 3 chips: 100, 10, 5

Infer State

Question:

{Task Description}

At the end of the video, how many chips were in `cup0`? For each chip, provide its value (order does not matter).
Format your response like this:
4 chips: 20, 5, 100, 100.

{Answer Prompt}

Ground truth: 4 chips: 20, 100, 10, 5

Compare State

Question:

{Task Description}

At which point in the video is the total value of chips in `cup0` higher, at start or end? Also, what is the difference in value between the two times?
Format your response like this:
{time_with_higher_value}, {difference_in_value} (e.g., start, 115).

{Answer Prompt}

Ground truth: end, 5

Predict State

Question:

{Task Description}

If the cups currently contain exactly the same chips as they did at the end of the video, and now we perform these actions in order: `remove 5 from cup0, remove 20 from cup0, remove 10 from cup0, add 5 to cup0, remove 100 from cup0`. How many chips would be in `cup0`? For each chip, provide its value (order does not matter).
Format your response like this:
4 chips: 20, 5, 100, 100.

{Answer Prompt}

Ground truth: 1 chips: 5

Predict Operation

Question:

{Task Description}

If the cups currently contain exactly the same chips as they did at the end of the video, to ensure that `cup0` contains exactly the following chips: `100, 10, 5, 5, 10, 20` (order does not matter), what sequence of actions should be performed?
Rules:
1. Each action must either add a chip to a cup or remove a chip from a cup.
2. Available chips for addition are: 5, 10, 20, 50, 100.
3. You may only remove a chip if it is already present in the cup.

Response Format:
List the actions in sequence, specifying the action, chip, and cup. Separate each action with a comma.
For example, `add 20 to cup0, remove 50 cup0`

{Answer Prompt}

Answer Prompt: Provide a summary of the final answer after 'Final Answer:'

VideoReasonBench

Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Abstract

Overview

VideoReasonBench Leaderboard

Statistics

VideoReasonBench Examples

VideoReasonBench Examples - Number

VideoReasonBench Examples - Cup

VideoReasonBench Examples - Circle

VideoReasonBench Examples - File

VideoReasonBench Examples - Card

VideoReasonBench Examples - Chip

Citation

Acknowledgement