EmbRACE-3K:Embodied Reasoning and Action in Complex Environments

Preview

This demo showcases step-by-step decision making by a fine-tuned Qwen2.5-VL-7B model, highlighting its ability to reason and act through closed-loop interaction based on perception.

An overview of the EmbRACE-3K, consisting of 3.1k tasks and 26k decision steps, covering diverse environments and multi-stage tasks that require perception, reasoning, and action.

Abstract

Recent vision-language models (VLMs) show strong results on offline image and video understanding, but their performance in interactive, embodied environments remains limited. In close loop settings, an agent acts from a first-person view, where each decision alters future observations. Even leading models like GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle with spatial reasoning and long-horizon planning. We present EmbRACE-3K , a dataset of over 3,000 language-guided tasks in diverse Unreal Engine environments. Each task spans multiple steps, with egocentric views, high-level instructions, grounded actions, and natural language rationales. We benchmark VLMs on three core skills: exploration, dynamic spatial-semantic reasoning, and multi-stage goal execution. In zero-shot tests, all models achieve below 20 percent success, showing clear room for improvement. Fine-tuning Qwen2.5-VL-7B with supervised and reinforcement learning leads to consistent gains across all task types, demonstrating the value of EmbRACE-3K for developing embodied intelligence.

Data Collection

The EmbRACE-3K dataset is built in four stages: (1) sampling diverse 6-DoF agent poses with ego views in virtual environments, (2) generating grounded task instructions using Gemini, (3) collecting human demonstrations, and (4) annotating each action with step-wise natural language reasoning to explain agent decisions and enhance interpretability.

Data Statistics

Data Example

The EmbRACE-3K dataset provides 6 types of embodied tasks, each represented as a trajectory of step-by-step interaction. These tasks are performed in a close-loop manner, where the agent's perception at each step determines its next action, forming a tight feedback cycle between observation and decision-making. For every step, the dataset includes: instruction, egocentric view image, thinking, and action. Below are a few demo examples.

Type:

Task:

Baseline Comparison

We show the qualitative performance of several strong vision-language models (VLMs) on a variety of tasks from the EmbRACE-3K dataset, including GPT-4o, Gemini 2.5 Pro, Qwen2.5-VL 7B (Original), and Qwen2.5-VL 7B (SFT + RL). You can select a task type to preview how different baselines behave in the same environment. The labels in the upper-right corner (e.g., x10) indicate the playback speed of each video, which has been accelerated for more efficient viewing.

Type:

GPT-4o

x?

Gemini 2.5 Pro

x?

Qwen2.5-VL 7B (Original)

x?

Qwen2.5-VL 7B (SFT + RL)

x?

BibTeX

@article{lin2025embrace3k,
        title={EmbRACE-3K: Embodied Reasoning and Action in Complex Environments},
        author={Lin, Mingxian and Huang, Wei and Li, Yitang and Jiang, Chengjie and Wu, Kui and Zhong, Fangwei and Qian, Shengju and Wang, Xin and Qi, Xiaojuan},
        journal={arXiv preprint arXiv:2507.10548},
        year={2025}
      }

Contact

Please contact us at lmx.mingxian@gmail.com if you have any question.