构建 AI Coding Agent 评测基准：从 SWE-bench 原理到自建评估框架的完整实战

2026 年，AI Coding Agent 已经从「辅助补全」进化到了「自主完成复杂任务」——Cursor、Claude Code、Codex 等工具每月处理数十亿行代码。但一个尖锐的问题摆在所有团队面前：你如何知道你的 AI Agent 到底有多「靠谱」？ SWE-bench 的评测数据显示，即使是最先进的 Agent，在真实 GitHub Issue 上的通过率也只有 49.2%——意味着超过一半的任务会失败或产出错误代码。如果你正在构建、选型或集成 AI Coding Agent，一套系统化的评测框架是做出正确决策的基础。

📌 记住： AI Agent 的评测不是「它能不能写出代码」，而是「它写出的代码能不能在生产环境中正确运行、不引入新 Bug、不破坏已有功能」。这两者之间有本质区别。

🔬 一、SWE-bench 评测原理深度解析

1.1 SWE-bench 的核心设计

SWE-bench（Software Engineering Benchmark）由 Princeton NLP 实验室于 2023 年提出，是目前最权威的 AI Coding Agent 评测基准。它的核心思路极为精妙：从真实的 GitHub 开源项目中提取 Issue 和对应的 Pull Request，让 AI Agent 根据 Issue 描述生成代码，然后用项目的现有测试套件来验证代码是否正确。

这个设计的巧妙之处在于：测试用例不是人工编写的，而是项目本身已有的——这意味着评测的是 Agent 解决真实工程问题的能力，而非「做题」能力。

// SWE-bench 的核心评测流程（简化示意）
// swe-bench-eval-flow.js

async function evaluateAgent(agent, task) {
  // 1. 克隆目标仓库到指定 commit
  const repo = await cloneRepo(task.repo, task.baseCommit);
  
  // 2. 将 Issue 描述作为输入，交给 Agent 处理
  const patch = await agent.solve({
    issue: task.issueBody,
    repo: repo,
    language: task.language,
  });
  
  // 3. 应用 Agent 生成的 patch
  await applyPatch(repo, patch);
  
  // 4. 运行该 Issue 关联的测试用例
  const testResult = await runTests(repo, task.testPatch);
  
  // 5. 判断是否通过
  return {
    passed: testResult.exitCode === 0,
    duration: testResult.duration,
    patch: patch,
    testOutput: testResult.stdout,
  };
}

1.2 SWE-bench 的数据集构成

SWE-bench 的数据集来自 12 个高质量 Python 开源项目，每个样本包含以下字段：

字段	说明	示例
`instance_id`	唯一标识	`django__django-16527`
`repo`	仓库地址	`django/django`
`base_commit`	问题所在的 commit	`a1b2c3d...`
`problem_statement`	Issue 描述	“QuerySet.bulk_create() fails with…”
`hints_text`	解题提示（可选）	PR 中的讨论内容
`patch`	标准答案 patch	diff 格式的代码变更
`test_patch`	验证用的测试 patch	需要通过的测试用例

SWE-bench Lite 是其精简版，包含 300 个经过人工验证的高质量样本，更适合快速评估。

⚠️ 警告： SWE-bench 的评测结果存在「数据泄漏」风险——如果 Agent 的训练数据中包含了这些开源项目的代码和 Issue，评测结果会虚高。在自建评测框架时，务必使用评测集发布日期之后的项目数据。

1.3 为什么需要自建评测框架

SWE-bench 虽然权威，但有三个明显的局限：

仅覆盖 Python 项目——你的 Agent 可能需要处理 TypeScript、Java、Go 等语言
仅评估代码生成——不覆盖代码审查、重构、测试生成等其他 Agent 能力
评测成本高——每个样本需要完整克隆仓库并运行测试，单次全量评测需要数小时

因此，对于实际的 Agent 开发和选型，你需要一个可定制、低成本、多维度的自建评测框架。

🏗️ 二、从零构建评测框架核心架构

2.1 整体架构设计

一个生产级的 AI Agent 评测框架由五个核心模块组成：

// benchmark-architecture.ts — 评测框架核心架构

interface BenchmarkTask {
  id: string;
  type: 'bug-fix' | 'feature' | 'refactor' | 'test-gen' | 'code-review';
  language: string;
  repository: string;
  baseCommit: string;
  description: string;
  context: TaskContext;        // 相关文件、依赖信息
  expected: ExpectedResult;    // 期望结果（patch、测试、评分标准）
  difficulty: 'easy' | 'medium' | 'hard';
  tags: string[];
}

interface TaskContext {
  relevantFiles: string[];     // 需要关注的文件列表
  dependencies: string[];      // 项目依赖
  testCommand: string;         // 测试运行命令
  buildCommand: string;        // 构建命令
  maxTokens: number;           // Token 预算上限
}

interface ExpectedResult {
  patch?: string;              // 期望的代码变更
  testFiles?: string[];        // 期望生成的测试文件
  mustPassTests: string[];     // 必须通过的测试用例
  mustNotBreakTests: string[]; // 不能破坏的已有测试
  qualityMetrics: QualityMetric[];
}

interface BenchmarkResult {
  taskId: string;
  agentId: string;
  passed: boolean;
  score: number;               // 0-100 综合评分
  metrics: {
    correctness: number;       // 功能正确性
    completeness: number;      // 任务完成度
    codeQuality: number;       // 代码质量
    efficiency: number;        // 执行效率
    safety: number;            // 安全性（是否引入漏洞）
  };
  tokenUsage: TokenUsage;
  duration: number;
  patch: string;
  testOutput: string;
  errors: string[];
}

2.2 沙箱执行环境

评测框架最关键的安全要求是：Agent 生成的代码必须在隔离的沙箱中执行。一个错误的 patch 不应该影响宿主系统。

// sandbox.ts — 基于 Docker 的沙箱执行环境

import Docker from 'dockerode';

class SandboxExecutor {
  private docker: Docker;

  constructor() {
    this.docker = new Docker({ socketPath: '/var/run/docker.sock' });
  }

  async execute(task: BenchmarkTask, patch: string): Promise<ExecutionResult> {
    // 创建隔离容器
    const container = await this.docker.createContainer({
      Image: `benchmark-${task.language}:latest`,
      Cmd: ['sh', '-c', this.buildCommand(task, patch)],
      HostConfig: {
        NetworkMode: 'none',           // 禁用网络（防止 Agent 偷偷调 API）
        Memory: 2 * 1024 * 1024 * 1024, // 2GB 内存限制
        CpuQuota: 200000,               // 2 核 CPU 限制
        ReadonlyRootfs: false,
        AutoRemove: true,
      },
      Env: [
        `TASK_ID=${task.id}`,
        `TIMEOUT=300`,  // 5 分钟超时
      ],
    });

    await container.start();

    // 等待执行完成或超时
    const result = await this.waitForCompletion(container, 300_000);

    return {
      exitCode: result.exitCode,
      stdout: result.stdout,
      stderr: result.stderr,
      duration: result.duration,
      memoryPeak: result.memoryPeak,
    };
  }

  private buildCommand(task: BenchmarkTask, patch: string): string {
    return [
      `cd /repo`,
      `git checkout ${task.baseCommit}`,
      `echo '${this.escapePatch(patch)}' | git apply -`,
      task.context.buildCommand,
      task.context.testCommand,
    ].join(' && ');
  }

  private async waitForCompletion(
    container: Docker.Container,
    timeout: number
  ): Promise<ContainerResult> {
    return new Promise((resolve, reject) => {
      const timer = setTimeout(async () => {
        await container.kill();
        reject(new Error(`Execution timeout: ${timeout}ms`));
      }, timeout);

      container.wait((err, data) => {
        clearTimeout(timer);
        if (err) return reject(err);

        container.logs({ stdout: true, stderr: true }, (logErr, logs) => {
          resolve({
            exitCode: data.StatusCode,
            stdout: logs?.toString() || '',
            stderr: '',
            duration: 0,
            memoryPeak: 0,
          });
        });
      });
    });
  }
}

💡 提示： 生产环境中，建议使用 gVisor 或 Firecracker 替代普通 Docker 容器，提供更强的内核级隔离。Docker 的默认 seccomp 配置仍允许部分系统调用，对于运行不受信任代码的场景不够安全。

2.3 多维度评分体系

单一的「通过/失败」指标不足以全面评估 Agent 能力。一个成熟的评测框架需要多维度评分：

// scorer.ts — 多维度评分引擎

interface ScorerConfig {
  weights: Record<string, number>;
  thresholds: Record<string, number>;
}

class AgentScorer {
  private config: ScorerConfig;

  constructor(config?: Partial<ScorerConfig>) {
    this.config = {
      weights: {
        correctness: 0.40,   // 功能正确性（最高权重）
        completeness: 0.20,  // 任务完成度
        codeQuality: 0.15,   // 代码质量
        efficiency: 0.10,    // Token 和时间效率
        safety: 0.15,        // 安全性
        ...config?.weights,
      },
      thresholds: {
        minCorrectness: 0.8,
        maxTokenBudget: 100000,
        ...config?.thresholds,
      },
    };
  }

  async score(result: ExecutionResult, task: BenchmarkTask): Promise<ScoreBreakdown> {
    const scores = {
      correctness: await this.scoreCorrectness(result, task),
      completeness: await this.scoreCompleteness(result, task),
      codeQuality: await this.scoreCodeQuality(result.patch),
      efficiency: this.scoreEfficiency(result),
      safety: await this.scoreSafety(result.patch, task),
    };

    // 加权综合评分
    const totalScore = Object.entries(scores).reduce(
      (sum, [key, value]) => sum + value * this.config.weights[key],
      0
    );

    return {
      total: Math.round(totalScore * 100) / 100,
      breakdown: scores,
      passed: scores.correctness >= this.config.thresholds.minCorrectness,
    };
  }

  // 功能正确性：运行测试用例
  private async scoreCorrectness(
    result: ExecutionResult,
    task: BenchmarkTask
  ): Promise<number> {
    const passedTests = this.parseTestResults(result.stdout);
    const requiredTests = task.expected.mustPassTests;
    const brokenTests = task.expected.mustNotBreakTests;

    // 必须通过的测试
    const requiredPassed = requiredTests.filter(t => passedTests.includes(t)).length;
    const requiredScore = requiredTests.length > 0
      ? requiredPassed / requiredTests.length
      : 1;

    // 不能破坏的测试
    const brokenCount = brokenTests.filter(t => !passedTests.includes(t)).length;
    const brokenPenalty = brokenTests.length > 0
      ? brokenCount / brokenTests.length
      : 0;

    return Math.max(0, requiredScore - brokenPenalty);
  }

  // 代码质量：静态分析
  private async scoreCodeQuality(patch: string): Promise<number> {
    const issues: string[] = [];

    // 检查常见代码质量问题
    if (patch.includes('any')) issues.push('使用了 any 类型');
    if (patch.includes('TODO')) issues.push('包含未完成的 TODO');
    if (patch.includes('console.log')) issues.push('包含调试日志');
    if (patch.includes('// @ts-ignore')) issues.push('跳过了类型检查');
    if (this.hasLongFunctions(patch)) issues.push('包含超长函数（>50行）');
    if (this.hasDeepNesting(patch)) issues.push('嵌套层级过深（>4层）');

    // 每个问题扣 0.1 分
    return Math.max(0, 1 - issues.length * 0.1);
  }

  // 效率评分：Token 消耗与时间
  private scoreEfficiency(result: ExecutionResult): number {
    const tokenScore = Math.max(0, 1 - result.tokenUsage.total / 100000);
    const timeScore = Math.max(0, 1 - result.duration / 300000); // 5 分钟基准
    return (tokenScore + timeScore) / 2;
  }

  // 安全性评分：检测潜在漏洞
  private async scoreSafety(patch: string, task: BenchmarkTask): Promise<number> {
    const vulnerabilities: string[] = [];

    // SQL 注入风险
    if (patch.includes('query(') && !patch.includes('parameterized'))
      vulnerabilities.push('潜在 SQL 注入');

    // XSS 风险
    if (patch.includes('innerHTML') && !patch.includes('sanitize'))
      vulnerabilities.push('潜在 XSS 攻击');

    // 命令注入风险
    if (patch.includes('exec(') || patch.includes('eval('))
      vulnerabilities.push('潜在命令注入');

    // 硬编码密钥
    if (/['"][A-Za-z0-9+/=]{32,}['"]/.test(patch))
      vulnerabilities.push('疑似硬编码密钥');

    return Math.max(0, 1 - vulnerabilities.length * 0.2);
  }
}

📊 三、评测数据集设计与 CI/CD 集成

3.1 自定义评测数据集设计

一个高质量的评测数据集需要覆盖多种任务类型和难度级别。以下是推荐的数据集构成：

任务类型	占比	难度分布	说明
Bug 修复	35%	简单 30%、中等 50%、困难 20%	最核心的能力
功能实现	25%	简单 20%、中等 40%、困难 40%	考察设计能力
代码重构	15%	中等 60%、困难 40%	考察代码理解力
测试生成	15%	简单 40%、中等 40%、困难 20%	考察边界思维
代码审查	10%	中等 50%、困难 50%	考察安全意识

构建数据集的最佳实践是从你自己的项目中提取真实的 Issue 和 PR：

// dataset-builder.ts — 从 GitHub 仓库构建评测数据集

import { Octokit } from '@octokit/rest';

class DatasetBuilder {
  private octokit: Octokit;

  constructor(token: string) {
    this.octokit = new Octokit({ auth: token });
  }

  async buildFromRepo(
    owner: string,
    repo: string,
    options: {
      minStars?: number;
      afterDate?: string;
      maxSamples?: number;
      includeTestPatch?: boolean;
    } = {}
  ): Promise<BenchmarkTask[]> {
    const tasks: BenchmarkTask[] = [];

    // 获取已合并的 PR（这些才有确定性的「正确答案」）
    const prs = await this.octokit.paginate(
      this.octokit.pulls.list,
      {
        owner,
        repo,
        state: 'closed',
        sort: 'updated',
        direction: 'desc',
        per_page: 100,
      }
    );

    for (const pr of prs) {
      if (!pr.merged_at) continue;
      if (options.afterDate && new Date(pr.merged_at) < new Date(options.afterDate)) continue;
      if (tasks.length >= (options.maxSamples || 100)) break;

      // 获取 PR 的代码变更
      const files = await this.octokit.paginate(
        this.octokit.pulls.listFiles,
        { owner, repo, pull_number: pr.number }
      );

      // 只保留有测试文件变更的 PR（确保可验证）
      const testFiles = files.filter(f =>
        f.filename.includes('test') || f.filename.includes('spec')
      );
      if (testFiles.length === 0) continue;

      // 提取 patch
      const codePatch = files
        .filter(f => !f.filename.includes('test'))
        .map(f => f.patch)
        .join('\n');

      const testPatch = testFiles.map(f => f.patch).join('\n');

      // 获取 Issue 描述（PR 关联的 Issue）
      const issueBody = pr.body || '';

      tasks.push({
        id: `${owner}__${repo}-${pr.number}`,
        type: this.classifyTask(issueBody, codePatch),
        language: this.detectLanguage(files),
        repository: `https://github.com/${owner}/${repo}`,
        baseCommit: pr.base.sha,
        description: issueBody,
        context: {
          relevantFiles: files.map(f => f.filename),
          dependencies: [],
          testCommand: `npm test`,  // 需要根据项目调整
          buildCommand: `npm run build`,
          maxTokens: 50000,
        },
        expected: {
          patch: codePatch,
          testFiles: testFiles.map(f => f.filename),
          mustPassTests: this.extractTestNames(testPatch),
          mustNotBreakTests: [],
          qualityMetrics: [],
        },
        difficulty: this.estimateDifficulty(codePatch, issueBody),
        tags: pr.labels?.map(l => l.name) || [],
      });
    }

    return tasks;
  }

  private classifyTask(description: string, patch: string): BenchmarkTask['type'] {
    const desc = description.toLowerCase();
    if (desc.includes('fix') || desc.includes('bug') || desc.includes('error'))
      return 'bug-fix';
    if (desc.includes('add') || desc.includes('feature') || desc.includes('implement'))
      return 'feature';
    if (desc.includes('refactor') || desc.includes('clean') || desc.includes('optimize'))
      return 'refactor';
    if (desc.includes('test') || desc.includes('coverage'))
      return 'test-gen';
    return 'code-review';
  }

  private estimateDifficulty(patch: string, description: string): 'easy' | 'medium' | 'hard' {
    const linesChanged = patch.split('\n').length;
    const filesChanged = (patch.match(/^diff --git/gm) || []).length;

    if (linesChanged < 20 && filesChanged <= 2) return 'easy';
    if (linesChanged < 100 && filesChanged <= 5) return 'medium';
    return 'hard';
  }
}

⚡ 关键结论： 评测数据集的质量决定评测结果的可信度。建议每个任务都经过人工验证，确保「标准答案」确实能解决问题且不引入新 Bug。自动化构建的数据集需要至少 20% 的人工抽检率。

3.2 CI/CD 集成：持续评测流水线

将 Agent 评测集成到 CI/CD 中，可以实现「模型更新 → 自动评测 → 质量门禁」的闭环：

# .github/workflows/agent-benchmark.yml
name: AI Agent Benchmark

on:
  # 模型配置变更时触发
  push:
    paths:
      - 'agent/**'
      - 'prompts/**'
      - 'models/**'
  # 每周一定时运行
  schedule:
    - cron: '0 9 * * 1'
  # 手动触发
  workflow_dispatch:
    inputs:
      dataset:
        description: '评测数据集'
        default: 'lite'
        type: choice
        options: ['lite', 'full', 'custom']

jobs:
  benchmark:
    runs-on: ubuntu-latest-8-cores
    timeout-minutes: 120

    steps:
      - uses: actions/checkout@v4

      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '22'

      - name: Install dependencies
        run: npm ci

      - name: Run benchmark
        id: benchmark
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npx tsx benchmark/run.ts \
            --dataset ${{ inputs.dataset || 'lite' }} \
            --agent ${{ matrix.agent }} \
            --output results.json

      - name: Check quality gate
        run: |
          SCORE=$(jq '.averageScore' results.json)
          echo "Average score: $SCORE"
          
          # 质量门禁：综合评分不低于 70
          if (( $(echo "$SCORE < 70" | bc -l) )); then
            echo "❌ Quality gate failed! Score $SCORE is below threshold 70"
            exit 1
          fi
          
          echo "✅ Quality gate passed with score $SCORE"

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: results.json

      - name: Comment on PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const results = require('./results.json');
            const body = `## 🤖 AI Agent Benchmark Results
            
            | Metric | Score |
            |--------|-------|
            | Correctness | ${results.metrics.correctness}% |
            | Completeness | ${results.metrics.completeness}% |
            | Code Quality | ${results.metrics.codeQuality}% |
            | Efficiency | ${results.metrics.efficiency}% |
            | Safety | ${results.metrics.safety}% |
            | **Total** | **${results.averageScore}%** |
            
            Tasks: ${results.passed}/${results.total} passed`;
            
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: body,
            });

🔧 四、高级评测技巧与避坑指南

4.1 Agent 评测的五个常见陷阱

在实际评测中，以下五个问题会严重扭曲评测结果：

陷阱 1：数据泄漏（Data Contamination）

如果 Agent 的训练数据中包含了评测数据集的代码或 Issue，评测结果会虚高。解决方案是使用时间分割——只用评测框架发布日期之后的项目数据。

陷阱 2：测试用例不完备

很多 GitHub 项目的测试覆盖率不到 60%，Agent 可能生成了「能通过测试但实际有 Bug」的代码。解决方案是为每个评测任务补充边界测试用例。

陷阱 3：环境差异

不同机器上的 Node.js 版本、Python 版本、系统库版本不同，会导致测试结果不可复现。解决方案是使用完全容器化的评测环境，所有依赖都在 Dockerfile 中锁定。

陷阱 4：随机性波动

LLM 的输出有随机性（temperature > 0），同一个任务多次运行可能得到不同结果。解决方案是每个任务运行 3-5 次取中位数，或设置 temperature = 0。

陷阱 5：过度优化基准

Agent 可能针对评测数据集的模式进行「应试优化」，在基准测试上表现好但实际使用中效果差。解决方案是保留一个未公开的 hold-out 测试集用于最终验证。

4.2 多 Agent 横向对比实战

评测框架最有价值的用途之一是横向对比不同 Agent 的能力差异：

// compare-agents.ts — 多 Agent 横向对比

interface AgentConfig {
  id: string;
  name: string;
  model: string;
  provider: string;
  temperature: number;
  maxTokens: number;
  systemPrompt?: string;
}

class AgentComparator {
  private benchmark: BenchmarkRunner;
  private scorer: AgentScorer;

  async compare(
    agents: AgentConfig[],
    dataset: BenchmarkTask[]
  ): Promise<ComparisonReport> {
    const results: Map<string, BenchmarkResult[]> = new Map();

    for (const agent of agents) {
      const agentResults: BenchmarkResult[] = [];

      for (const task of dataset) {
        // 每个任务运行 3 次取中位数
        const runs: BenchmarkResult[] = [];
        for (let i = 0; i < 3; i++) {
          const result = await this.benchmark.run(agent, task);
          runs.push(result);
        }

        // 取中位数
        const median = this.getMedian(runs);
        agentResults.push(median);
      }

      results.set(agent.id, agentResults);
    }

    return this.generateReport(agents, results);
  }

  private generateReport(
    agents: AgentConfig[],
    results: Map<string, BenchmarkResult[]>
  ): ComparisonReport {
    const report: ComparisonReport = {
      timestamp: new Date().toISOString(),
      agents: [],
    };

    for (const agent of agents) {
      const agentResults = results.get(agent.id)!;
      const passed = agentResults.filter(r => r.passed).length;
      const total = agentResults.length;

      report.agents.push({
        id: agent.id,
        name: agent.name,
        model: agent.model,
        overallScore: this.average(agentResults.map(r => r.score)),
        passRate: (passed / total * 100).toFixed(1) + '%',
        avgTokens: this.average(agentResults.map(r => r.tokenUsage.total)),
        avgDuration: this.average(agentResults.map(r => r.duration)),
        byDifficulty: {
          easy: this.scoreByDifficulty(agentResults, dataset, 'easy'),
          medium: this.scoreByDifficulty(agentResults, dataset, 'medium'),
          hard: this.scoreByDifficulty(agentResults, dataset, 'hard'),
        },
        byType: {
          'bug-fix': this.scoreByType(agentResults, dataset, 'bug-fix'),
          'feature': this.scoreByType(agentResults, dataset, 'feature'),
          'refactor': this.scoreByType(agentResults, dataset, 'refactor'),
          'test-gen': this.scoreByType(agentResults, dataset, 'test-gen'),
        },
      });
    }

    // 按综合评分排序
    report.agents.sort((a, b) => b.overallScore - a.overallScore);

    return report;
  }

  private average(values: number[]): number {
    return Math.round(values.reduce((a, b) => a + b, 0) / values.length * 100) / 100;
  }
}

// 使用示例
const comparator = new AgentComparator();

const report = await comparator.compare(
  [
    { id: 'claude-sonnet', name: 'Claude Sonnet 4', model: 'claude-sonnet-4', provider: 'anthropic', temperature: 0, maxTokens: 8192 },
    { id: 'gpt-4.1', name: 'GPT-4.1', model: 'gpt-4.1', provider: 'openai', temperature: 0, maxTokens: 8192 },
    { id: 'deepseek-v4', name: 'DeepSeek V4', model: 'deepseek-v4', provider: 'deepseek', temperature: 0, maxTokens: 8192 },
  ],
  dataset
);

console.log(JSON.stringify(report, null, 2));

以下是三个主流 Agent 在 SWE-bench Lite 上的典型表现对比：

Agent	通过率	平均 Token	平均耗时	Bug 修复	功能实现	代码重构
Claude Sonnet 4	49.2%	45,000	120s	55.3%	42.1%	48.7%
GPT-4.1	45.8%	52,000	95s	51.2%	39.8%	44.2%
DeepSeek V4	42.1%	38,000	85s	47.6%	36.5%	41.8%

⚠️ 警告： 以上数据为典型参考值，实际表现会因评测版本、数据集构成和提示词策略而有显著差异。不要仅凭一个评测数据集的结果做技术选型——至少在 2-3 个不同的评测集上验证。

4.3 评测成本控制

大规模评测的成本不容忽视。以下是优化评测成本的实用策略：

策略	成本节省	适用场景
分层评测	60-70%	先用 lite 数据集快速筛选，再用 full 数据集精确评估
缓存中间结果	30-40%	相同任务 + 相同 Agent 配置不重复运行
并行执行	时间成本 -80%	使用多个沙箱并行运行不同任务
采样评测	50-80%	从大数据集中随机采样 20% 作为快速评估集
使用本地模型	100%（API 费用）	用 Ollama 运行本地模型做初步评测

💡 总结与工具推荐

构建 AI Coding Agent 评测框架不是一次性工作，而是一个持续迭代的过程。以下是核心建议：

三个必须做到的原则：

✅ 评测数据集必须来自真实项目——人工编写的「考试题」无法反映真实工程场景的复杂性
✅ 沙箱隔离必须做到位——Agent 生成的代码是不受信任的，必须在完全隔离的环境中执行
✅ 评测结果必须可复现——锁定所有环境版本、使用固定随机种子、记录完整的执行日志

三个必须避免的错误：

❌ 不要只看通过率——一个通过率 50% 但代码质量极差的 Agent，不如通过率 45% 但代码整洁的 Agent
❌ 不要在训练数据上评测——使用时间分割确保评测数据不在 Agent 的训练集中
❌ 不要忽略安全性评分——AI 生成的代码中，约 15% 存在潜在安全问题

推荐工具链：

工具	用途	链接
SWE-bench	标准评测基准	github.com/princeton-nlp/SWE-bench
Dockerode	沙箱执行	npmjs.com/package/dockerode
Vitest	测试框架	vitest.dev
Octokit	GitHub API	github.com/octokit/octokit.js
LiteLLM	多模型统一 API	litellm.ai

⚡ 关键结论： AI Coding Agent 的评测不是一个「有就更好」的可选项，而是一个「没有就盲飞」的必需品。投入 1-2 周构建评测框架，可以在未来 12 个月内避免无数次错误的技术选型和生产事故。评测框架本身，就是你最重要的 AI 基础设施之一。