GAIA - QitOS

GAIA (General AI Assistants) is a benchmark of real-world tasks that require multi-step reasoning, web research, file inspection, and arithmetic. Tasks are grouped into three difficulty levels and evaluated by exact-match comparison to a reference answer. QitOS provides GaiaAdapter to convert GAIA dataset rows into Task objects. The canonical execution path is qit bench run, while examples/benchmarks/gaia_eval.py remains available as a thin wrapper over the same official result contract.

Setup

Install benchmark dependencies

pip install "qitos[benchmarks]"

Authenticate with HuggingFace

GAIA is a gated dataset. Request access at huggingface.co/datasets/gaia-benchmark/GAIA, then set your token:

export HF_TOKEN="hf_..."

Set your model API key

export OPENAI_API_KEY="sk-..."
# or for a custom endpoint:
export OPENAI_BASE_URL="https://api.siliconflow.cn/v1/"

Loading tasks

Use GaiaAdapter to load the dataset and convert rows to Task objects:

from qitos.benchmark import GaiaAdapter

adapter = GaiaAdapter()

# Load from HuggingFace (requires HF_TOKEN)
records = adapter.load_huggingface_records(split="validation")
tasks = adapter.to_tasks(records, split="validation", limit=10)

print(tasks[0].id)        # e.g. "7bd4f145-3dfe-..."
print(tasks[0].objective) # The question text
print(tasks[0].inputs["level"])        # 1, 2, or 3
print(tasks[0].inputs["attachments"])  # list of file paths

Or use the one-line convenience loader:

from qitos.benchmark.gaia.adapter import load_gaia_tasks

tasks = load_gaia_tasks(split="validation", limit=20)

Loading from a local snapshot

If you have downloaded the dataset locally, load from disk to avoid repeated HuggingFace requests:

adapter = GaiaAdapter(local_dir="data/gaia")

# Download snapshot once
adapter.snapshot_dataset(local_dir="data/gaia", hf_token="hf_...")

# Load from local cache thereafter
records = adapter.load_local_records(split="validation", local_dir="data/gaia")

Configuration

GaiaAdapter accepts the following parameters:

Parameter	Default	Description
`dataset_name`	`"gaia-benchmark/GAIA"`	HuggingFace repo ID
`annotated_dataset_name`	`"smolagents/GAIA-annotated"`	Annotated variant repo ID
`local_dir`	`"data/gaia"`	Local snapshot directory
`config_name`	`"2023_all"`	Dataset config passed to `load_dataset`
`default_subset`	`None`	Optional subset filter
`default_max_steps`	`24`	Step budget per task
`include_raw_record`	`True`	Attach raw row to `task.metadata`

Running the evaluation

Start with the official CLI:

qit bench run \
  --benchmark gaia \
  --split validation \
  --limit 50 \
  --root data/gaia \
  --output results/gaia_validation.jsonl \
  --model-name "Qwen/Qwen3-8B"

Then aggregate and inspect:

qit bench eval --input results/gaia_validation.jsonl --json
qita board --logdir runs

If you want the benchmark-specific reference wrapper, use examples/benchmarks/gaia_eval.py. The bundled gaia_eval.py script runs an OpenDeepResearch-style ReAct agent that uses web search, URL visiting, file reading, and command execution. Run a single task:

python examples/benchmarks/gaia_eval.py \
  --gaia-split validation \
  --gaia-index 0 \
  --max-steps 16 \
  --model-name "Qwen/Qwen3-8B" \
  --api-key "$OPENAI_API_KEY"

Run the full benchmark:

python examples/benchmarks/gaia_eval.py \
  --run-all \
  --gaia-split validation \
  --limit 50 \
  --concurrency 4 \
  --max-steps 16 \
  --output-jsonl results/gaia_validation.jsonl \
  --trace-logdir runs \
  --model-name "Qwen/Qwen3-8B" \
  --api-key "$OPENAI_API_KEY"

Resume an interrupted run:

python examples/benchmarks/gaia_eval.py \
  --run-all \
  --resume \
  --output-jsonl results/gaia_validation.jsonl \
  --gaia-split validation \
  --api-key "$OPENAI_API_KEY"

Pass --gaia-use-annotated to load the smolagents/GAIA-annotated variant, which includes pre-normalized answers for faster iteration.

Agent architecture

The evaluation script builds an OpenDeepResearchGaiaAgent with the following toolset:

from qitos.kit import CodingToolSet
from qitos.kit.tool.browser import (
    ArchiveSearch,
    FindInPage,
    FindNext,
    PageDown,
    PageUp,
    VisitURL,
    WebSearch,
)

registry.register(WebSearch())
registry.register(VisitURL())
registry.register(PageDown())
registry.register(FindInPage())
registry.register(FindNext())
registry.register(ArchiveSearch())
registry.include(
    CodingToolSet(
        workspace_root=workspace_root,
        include_notebook=False,
        enable_lsp=False,
        enable_tasks=False,
        enable_web=False,
        expose_modern_names=False,
    )
)

The agent uses a ReActTextParser expecting Thought: / Action: output format. You can swap in any AgentModule subclass and pass it to Engine.run(task) — the adapter produces standard Task objects.

Task structure

Each Task produced by GaiaAdapter contains:

Task(
    id="7bd4f145-...",             # GAIA task_id or generated fallback
    objective="What is the ...",   # Question text
    inputs={
        "benchmark": "GAIA",
        "split": "validation",
        "question": "...",
        "reference_answer": "42",  # For evaluation, not passed to agent
        "level": 1,                # 1, 2, or 3
        "attachments": ["data/gaia/validation/file.pdf"],
    },
    resources=[TaskResource(kind="file", path="...", required=False)],
    env_spec=EnvSpec(type="host", capabilities=["fs.read_text", "cmd.run", "network.http"]),
    budget=TaskBudget(max_steps=24),
)

Expected output

Each run appends one JSON line to the output file:

{
  "task_id": "7bd4f145-3dfe-4c57-a0b2-abcdef123456",
  "split": "validation",
  "question": "What is the largest prime factor of ...",
  "reference_answer": "17",
  "prediction": "17",
  "stop_reason": "final",
  "steps": 8,
  "error": null,
  "latency_seconds": 14.2,
  "trace_run_dir": "runs/qitos_gaia_odr_7bd4f145_20250101_120000"
}

Interpreting results:

stop_reason: "final" — the agent produced a Final Answer.
stop_reason: "max_steps" — the agent hit the step budget without answering.
stop_reason: "exception" — a runtime error occurred; check error.

Compute accuracy by comparing prediction to reference_answer with exact-match normalization (strip whitespace, lowercase). Then inspect trace runs with qita:

qita board --logdir runs
qita replay --run runs/qitos_gaia_odr_7bd4f145_20250101_120000

​Setup

​Loading tasks

​Loading from a local snapshot

​Configuration

​Running the evaluation

​Agent architecture

​Task structure

​Expected output

Setup

Loading tasks

Loading from a local snapshot

Configuration

Running the evaluation

Agent architecture

Task structure

Expected output