构建生产级 JSON API 错误处理框架：RFC 9457 与客户端自动恢复

根据 Postman 2025 年的 API 状态报告，超过 68% 的生产环境 API 故障源于错误处理不当——不一致的错误格式让客户端无法自动恢复，模糊的错误信息让开发者在调试时浪费大量时间。如果你的 JSON API 仍然在返回 { "error": "something went wrong" } 这样的响应，那么你的系统正在默默地制造技术债务。RFC 9457 Problem Details 规范提供了一个标准化的解决方案，但仅了解规范远远不够——你需要一套从服务端到客户端的完整错误处理框架。

🔐 一、RFC 9457 Problem Details 规范深度解析

RFC 9457（2023 年发布，取代了旧版 RFC 7807）定义了一种标准化的 JSON 格式，用于 HTTP API 的错误响应。它的核心价值在于：让机器可以解析错误，让人类可以理解错误。

1.1 规范结构与字段解析

Problem Details JSON 对象包含以下核心字段：

字段	类型	必填	说明
`type`	string (URI)	否	错误类型的文档 URI
`title`	string	否	错误类型的简短描述（人类可读）
`status`	number	否	HTTP 状态码
`detail`	string	否	本次错误的具体描述（区别于 `title` 的通用描述）
`instance`	string (URI)	否	发生错误的具体资源 URI

📌 记住：title 描述的是"这类错误是什么"，detail 描述的是"这次具体出了什么问题"。例如 title 可以是 “参数校验失败”，而 detail 可以是 “字段 ‘email’ 不是有效的邮箱格式”。

一个合规的 RFC 9457 响应示例：

// RFC 9457 Problem Details 标准响应
{
  "type": "https://api.example.com/errors/validation",
  "title": "请求参数校验失败",
  "status": 422,
  "detail": "字段 'email' 的值 'not-an-email' 不符合邮箱格式要求",
  "instance": "/api/users/123",
  "errors": [
    { "field": "email", "code": "INVALID_FORMAT", "message": "不是有效的邮箱格式" }
  ]
}

⚠️ **警告：**RFC 9457 要求响应的 Content-Type 必须是 application/problem+json，而不是普通的 application/json。这使得客户端可以通过 Content-Type 自动识别错误响应并切换到统一的错误处理逻辑。

1.2 扩展字段设计

规范允许添加自定义扩展字段，这是生产环境中最有价值的部分。推荐的扩展字段设计：

// 扩展字段设计模式
const errorResponse = {
  // RFC 9457 标准字段
  type: "https://api.example.com/errors/rate-limit",
  title: "请求频率超限",
  status: 429,
  detail: "您在最近 60 秒内发送了 150 次请求，超过限制的 100 次",
  instance: "/api/data/export",

  // 自定义扩展字段
  retryAfter: 30,                    // 建议重试等待秒数
  requestId: "req_8f3a2b1c",        // 请求追踪 ID
  errorCode: "RATE_LIMIT_EXCEEDED", // 机器可读的错误码
  timestamp: "2026-06-11T10:30:00Z", // 错误发生时间
  documentation: "https://api.example.com/docs/errors#rate-limit" // 文档链接
};

⚡ 关键结论： 标准字段保证互操作性，扩展字段满足业务需求。两者结合才能构建真正实用的错误响应体系。

🚀 二、构建统一错误处理框架

2.1 错误分类与编码体系

一个生产级 API 需要一套结构化的错误分类体系。我推荐采用"域 + 类型 + 具体错误"的三层编码结构：

// 错误码体系设计：DOMAIN_TYPE_DETAIL
// 第一层：错误域（Auth, Validation, Resource, System 等）
// 第二层：错误类型（NOT_FOUND, EXPIRED, CONFLICT 等）
// 第三层：具体错误（可选，用于细分）

const ERROR_REGISTRY = {
  // 认证域 (AUTH_)
  AUTH_TOKEN_EXPIRED: {
    type: "https://api.example.com/errors/auth/token-expired",
    title: "访问令牌已过期",
    status: 401,
    retryable: true
  },
  AUTH_TOKEN_INVALID: {
    type: "https://api.example.com/errors/auth/token-invalid",
    title: "访问令牌无效",
    status: 401,
    retryable: false
  },
  AUTH_INSUFFICIENT_SCOPE: {
    type: "https://api.example.com/errors/auth/insufficient-scope",
    title: "权限不足",
    status: 403,
    retryable: false
  },

  // 校验域 (VALIDATION_)
  VALIDATION_FAILED: {
    type: "https://api.example.com/errors/validation/failed",
    title: "请求参数校验失败",
    status: 422,
    retryable: false
  },
  VALIDATION_MISSING_FIELD: {
    type: "https://api.example.com/errors/validation/missing-field",
    title: "缺少必填字段",
    status: 422,
    retryable: false
  },

  // 资源域 (RESOURCE_)
  RESOURCE_NOT_FOUND: {
    type: "https://api.example.com/errors/resource/not-found",
    title: "资源不存在",
    status: 404,
    retryable: false
  },
  RESOURCE_CONFLICT: {
    type: "https://api.example.com/errors/resource/conflict",
    title: "资源状态冲突",
    status: 409,
    retryable: true
  },

  // 限流域 (RATE_LIMIT_)
  RATE_LIMIT_EXCEEDED: {
    type: "https://api.example.com/errors/rate-limit/exceeded",
    title: "请求频率超限",
    status: 429,
    retryable: true
  },

  // 系统域 (SYSTEM_)
  SYSTEM_INTERNAL: {
    type: "https://api.example.com/errors/system/internal",
    title: "服务器内部错误",
    status: 500,
    retryable: true
  },
  SYSTEM_SERVICE_UNAVAILABLE: {
    type: "https://api.example.com/errors/system/service-unavailable",
    title: "服务暂时不可用",
    status: 503,
    retryable: true
  }
};

2.2 服务端中间件实现（Express/Hono）

下面是完整的错误处理中间件实现，支持 Express 和 Hono 框架：

// 统一错误处理中间件（Node.js / Express）
import express from 'express';

// 自定义 API 错误类
class ApiError extends Error {
  constructor(errorCode, detail, extensions = {}) {
    const registry = ERROR_REGISTRY[errorCode];
    if (!registry) throw new Error(`Unknown error code: ${errorCode}`);

    super(detail || registry.title);
    this.errorCode = errorCode;
    this.type = registry.type;
    this.title = registry.title;
    this.status = registry.status;
    this.retryable = registry.retryable;
    this.detail = detail;
    this.extensions = extensions;
  }
}

// 工厂函数：快速创建常见错误
const Errors = {
  tokenExpired: (detail) => new ApiError('AUTH_TOKEN_EXPIRED', detail),
  tokenInvalid: (detail) => new ApiError('AUTH_TOKEN_INVALID', detail),
  forbidden: (detail) => new ApiError('AUTH_INSUFFICIENT_SCOPE', detail),
  validationFailed: (detail, fields) =>
    new ApiError('VALIDATION_FAILED', detail, { errors: fields }),
  notFound: (resource, id) =>
    new ApiError('RESOURCE_NOT_FOUND', `${resource} '${id}' 不存在`),
  conflict: (detail) => new ApiError('RESOURCE_CONFLICT', detail),
  rateLimited: (retryAfter) =>
    new ApiError('RATE_LIMIT_EXCEEDED', `请在 ${retryAfter} 秒后重试`, { retryAfter }),
  internal: (requestId) =>
    new ApiError('SYSTEM_INTERNAL', `内部错误，请求 ID: ${requestId}`, { requestId }),
};

// RFC 9457 错误响应构建器
function buildProblemResponse(err, req) {
  const problem = {
    type: err.type,
    title: err.title,
    status: err.status,
    detail: err.detail,
    instance: req.originalUrl,
    errorCode: err.errorCode,
    retryable: err.retryable,
    timestamp: new Date().toISOString(),
    requestId: req.headers['x-request-id'] || `req_${Date.now().toString(36)}`
  };

  // 合并扩展字段
  if (err.extensions) {
    Object.assign(problem, err.extensions);
  }

  // 生产环境隐藏内部错误细节
  if (err.status >= 500 && process.env.NODE_ENV === 'production') {
    problem.detail = '服务器内部错误，请联系技术支持';
    delete problem.requestId; // 避免泄露内部信息
  }

  return problem;
}

// Express 错误处理中间件
function problemDetailsMiddleware(err, req, res, _next) {
  // 处理已知的 API 错误
  if (err instanceof ApiError) {
    const problem = buildProblemResponse(err, req);
    return res.status(err.status)
      .type('application/problem+json')
      .json(problem);
  }

  // 处理 Joi / Zod 校验错误
  if (err.isJoi || err.name === 'ZodError') {
    const fields = err.details?.map(d => ({
      field: d.path?.join('.'),
      code: d.type?.toUpperCase(),
      message: d.message
    })) || [];
    const problem = buildProblemResponse(
      Errors.validationFailed('请求参数校验失败', fields), req
    );
    return res.status(422)
      .type('application/problem+json')
      .json(problem);
  }

  // 未知错误 → 500
  console.error('Unhandled error:', err);
  const problem = buildProblemResponse(Errors.internal(), req);
  return res.status(500)
    .type('application/problem+json')
    .json(problem);
}

// 使用示例
const app = express();
app.use(express.json());

// 业务路由
app.get('/api/users/:id', async (req, res, next) => {
  try {
    const user = await findUser(req.params.id);
    if (!user) throw Errors.notFound('用户', req.params.id);
    res.json({ data: user });
  } catch (err) {
    next(err); // 传递给错误处理中间件
  }
});

// 注册错误处理中间件（必须放在所有路由之后）
app.use(problemDetailsMiddleware);

app.listen(3000);

2.3 错误响应的 OpenAPI 文档化

将错误响应集成到 OpenAPI 规范中，让 API 文档自动生成错误说明：

# OpenAPI 3.1 错误响应定义
paths:
  /api/users/{id}:
    get:
      responses:
        "200":
          description: 成功返回用户信息
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/User"
        "404":
          description: 用户不存在
          content:
            application/problem+json:
              schema:
                $ref: "#/components/schemas/ProblemDetail"
              example:
                type: "https://api.example.com/errors/resource/not-found"
                title: "资源不存在"
                status: 404
                detail: "用户 'usr_abc123' 不存在"
                errorCode: "RESOURCE_NOT_FOUND"
                retryable: false
        "429":
          description: 请求频率超限
          content:
            application/problem+json:
              schema:
                allOf:
                  - $ref: "#/components/schemas/ProblemDetail"
                  - type: object
                    properties:
                      retryAfter:
                        type: integer
                        description: 建议重试等待秒数
components:
  schemas:
    ProblemDetail:
      type: object
      properties:
        type:
          type: string
          format: uri
        title:
          type: string
        status:
          type: integer
        detail:
          type: string
        instance:
          type: string
          format: uri
        errorCode:
          type: string
        retryable:
          type: boolean
        timestamp:
          type: string
          format: date-time

💡 提示： 在 OpenAPI 文档中为每个端点定义所有可能的错误响应，这比任何文字说明都更有效。Swagger UI 会自动渲染这些错误示例。

💡 三、客户端错误恢复策略

3.1 指数退避重试（Exponential Backoff with Jitter）

对于可重试的错误（retryable: true），客户端应自动重试并采用指数退避策略：

// 客户端统一 API 客户端（带自动重试与错误恢复）
class ApiClient {
  constructor(baseUrl, options = {}) {
    this.baseUrl = baseUrl;
    this.maxRetries = options.maxRetries ?? 3;
    this.baseDelay = options.baseDelay ?? 1000;    // 基础延迟 1 秒
    this.maxDelay = options.maxDelay ?? 30000;      // 最大延迟 30 秒
    this.onAuthError = options.onAuthError ?? null; // Token 刷新回调
  }

  // 指数退避 + 随机抖动（Jitter）
  calculateDelay(attempt, retryAfter) {
    // 如果服务端指定了 retryAfter，优先使用
    if (retryAfter) return retryAfter * 1000;

    // 指数退避：base * 2^attempt + 随机抖动
    const exponential = this.baseDelay * Math.pow(2, attempt);
    const jitter = exponential * 0.5 * Math.random(); // 50% 随机抖动
    return Math.min(exponential + jitter, this.maxDelay);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }

  async request(path, options = {}) {
    let lastError;
    let attempt = 0;

    while (attempt <= this.maxRetries) {
      try {
        const response = await fetch(`${this.baseUrl}${path}`, {
          ...options,
          headers: {
            'Content-Type': 'application/json',
            'Accept': 'application/json, application/problem+json',
            ...options.headers
          }
        });

        // 成功响应
        if (response.ok) {
          const contentType = response.headers.get('content-type');
          if (contentType?.includes('application/json')) {
            return await response.json();
          }
          return await response.text();
        }

        // 解析 Problem Details 错误响应
        const contentType = response.headers.get('content-type') || '';
        let problem;

        if (contentType.includes('problem+json') || contentType.includes('application/json')) {
          problem = await response.json();
        } else {
          problem = { status: response.status, title: response.statusText };
        }

        // Token 过期 → 尝试刷新 Token
        if (problem.errorCode === 'AUTH_TOKEN_EXPIRED' && this.onAuthError) {
          const refreshed = await this.onAuthError();
          if (refreshed) {
            attempt++;
            continue; // 刷新成功，用新 Token 重试
          }
        }

        // 判断是否可重试
        if (problem.retryable && attempt < this.maxRetries) {
          const delay = this.calculateDelay(attempt, problem.retryAfter);
          console.warn(
            `[API Retry] ${problem.errorCode}: 第 ${attempt + 1} 次重试，` +
            `等待 ${Math.round(delay)}ms`
          );
          await this.sleep(delay);
          attempt++;
          lastError = problem;
          continue;
        }

        // 不可重试的错误，直接抛出
        throw problem;

      } catch (err) {
        // 网络错误（fetch 本身失败）→ 可重试
        if (err instanceof TypeError && err.message.includes('fetch')) {
          if (attempt < this.maxRetries) {
            const delay = this.calculateDelay(attempt);
            console.warn(`[API Retry] 网络错误: 第 ${attempt + 1} 次重试，等待 ${Math.round(delay)}ms`);
            await this.sleep(delay);
            attempt++;
            lastError = err;
            continue;
          }
        }
        throw err;
      }
    }

    throw lastError;
  }
}

// 使用示例
const api = new ApiClient('https://api.example.com', {
  maxRetries: 3,
  baseDelay: 1000,
  onAuthError: async () => {
    // 刷新 Token 的逻辑
    const result = await refreshAccessToken();
    return result.success;
  }
});

// 自动重试的 API 调用
try {
  const user = await api.request('/api/users/123');
  console.log('用户数据:', user);
} catch (error) {
  if (error.errorCode) {
    // Problem Details 错误
    console.error(`[${error.errorCode}] ${error.detail}`);
  } else {
    console.error('请求失败:', error);
  }
}

⚠️ 警告： 并非所有 5xx 错误都应该重试。例如 500 可能是代码 Bug（重试无意义），而 503 通常是暂时故障（重试有意义）。建议在 ERROR_REGISTRY 中用 retryable 字段精确控制。

3.2 错误降级与缓存策略

对于关键业务接口，应在重试耗尽后启用降级策略：

// 带降级策略的 API 请求
async function fetchWithFallback(api, path, fallbackData, cacheKey) {
  try {
    const data = await api.request(path);

    // 成功 → 更新缓存
    if (cacheKey) {
      localStorage.setItem(cacheKey, JSON.stringify({
        data,
        timestamp: Date.now()
      }));
    }
    return { data, source: 'api' };

  } catch (error) {
    console.warn(`API 请求失败: ${error.errorCode}，尝试降级`);

    // 降级 1：使用本地缓存
    if (cacheKey) {
      const cached = localStorage.getItem(cacheKey);
      if (cached) {
        const { data, timestamp } = JSON.parse(cached);
        const age = Date.now() - timestamp;
        // 缓存不超过 1 小时可用
        if (age < 3600_000) {
          console.warn(`使用缓存数据 (${Math.round(age / 60000)} 分钟前)`);
          return { data, source: 'cache', stale: true };
        }
      }
    }

    // 降级 2：使用预设的兜底数据
    if (fallbackData) {
      console.warn('使用兜底数据');
      return { data: fallbackData, source: 'fallback' };
    }

    throw error; // 无法降级，继续抛出
  }
}

3.3 用户友好的错误展示层

将机器可读的错误转换为用户可理解的提示：

// 错误展示层：Problem Details → 用户友好的 UI 提示
function getErrorDisplay(problem) {
  const displays = {
    AUTH_TOKEN_EXPIRED: {
      icon: '🔒',
      message: '登录已过期，请重新登录',
      action: { label: '重新登录', handler: 'redirectToLogin' },
      level: 'warning'
    },
    AUTH_INSUFFICIENT_SCOPE: {
      icon: '🚫',
      message: '您没有权限执行此操作',
      action: { label: '联系管理员', handler: 'openSupport' },
      level: 'error'
    },
    VALIDATION_FAILED: {
      icon: '📝',
      message: problem.detail || '请检查输入内容',
      action: null, // 表单内联显示，不需要全局提示
      level: 'info',
      fields: problem.errors // 返回字段级错误，供表单组件使用
    },
    RESOURCE_NOT_FOUND: {
      icon: '🔍',
      message: problem.detail || '找不到请求的资源',
      action: { label: '返回首页', handler: 'goHome' },
      level: 'warning'
    },
    RATE_LIMIT_EXCEEDED: {
      icon: '⏳',
      message: `操作太频繁，请 ${problem.retryAfter || 60} 秒后重试`,
      action: null,
      level: 'info'
    },
    SYSTEM_SERVICE_UNAVAILABLE: {
      icon: '🔧',
      message: '服务暂时不可用，请稍后重试',
      action: { label: '重试', handler: 'retry' },
      level: 'error'
    }
  };

  return displays[problem.errorCode] || {
    icon: '❌',
    message: '发生未知错误，请稍后重试',
    action: { label: '重试', handler: 'retry' },
    level: 'error'
  };
}

⚠️ 四、生产环境避坑指南

4.1 常见反模式 ❌

在实际项目中，以下错误处理方式是最常见的坑：

❌ 反模式 1：所有错误都返回 200

// ❌ 错误做法：用 HTTP 200 包装错误
res.json({ code: -1, message: "用户不存在" });
// 客户端无法利用 HTTP 状态码进行自动处理
// 拦截器、监控工具、CDN 都无法正确识别错误

✅ 正确做法： 使用正确的 HTTP 状态码 + Problem Details：

// ✅ 正确做法
throw Errors.notFound('用户', userId);
// → 404 + application/problem+json

❌ 反模式 2：暴露内部实现细节

// ❌ 错误做法
res.status(500).json({
  error: "QueryFailedError: duplicate key value violates unique constraint \"users_email_key\""
  // 暴露了数据库类型、表名、约束名
});

✅ 正确做法： 生产环境隐藏细节，仅暴露有用信息：

// ✅ 正确做法
throw new ApiError('RESOURCE_CONFLICT', '该邮箱已被注册', {
  field: 'email',
  code: 'DUPLICATE_VALUE'
});
// → 409 + 告诉用户具体哪个字段有问题，但不暴露数据库细节

❌ 反模式 3：缺少错误分类，所有错误都是 500

// ❌ 错误做法
catch (err) {
  res.status(500).json({ error: err.message });
  // 参数校验错误和服务器崩溃都返回 500？
  // 客户端无法区分"你的问题"和"我的问题"
}

💡 提示： 4xx 错误是"客户端的问题"（你的请求有误），5xx 错误是"服务端的问题"（我的服务出故障）。混淆这两者会让客户端的自动重试策略完全失效——对客户端错误重试是浪费资源，对服务端错误不重试是放弃恢复机会。

4.2 分布式系统中的错误传播

在微服务架构中，错误需要在服务链路中正确传播：

// 分布式错误传播：将上游错误包装为 Problem Details
async function callUpstreamService(serviceUrl, requestId) {
  try {
    const response = await fetch(serviceUrl, {
      headers: { 'X-Request-ID': requestId }
    });

    if (!response.ok) {
      const contentType = response.headers.get('content-type') || '';

      // 如果上游返回的是 Problem Details，保留原始错误信息
      if (contentType.includes('problem+json')) {
        const upstreamProblem = await response.json();

        // 包装上游错误，保留错误链
        throw new ApiError('SYSTEM_SERVICE_UNAVAILABLE',
          `上游服务错误: ${upstreamProblem.detail}`, {
            upstreamError: {
              service: new URL(serviceUrl).hostname,
              status: upstreamProblem.status,
              errorCode: upstreamProblem.errorCode
            },
            requestId
          }
        );
      }

      throw Errors.internal(requestId);
    }

    return await response.json();
  } catch (err) {
    if (err instanceof ApiError) throw err;
    // 网络错误
    throw new ApiError('SYSTEM_SERVICE_UNAVAILABLE',
      '上游服务连接失败', { requestId });
  }
}

4.3 错误监控与告警集成

// 错误监控中间件（集成 Sentry / 自定义监控）
function errorMonitoringMiddleware(err, req, res, next) {
  // 构建 Problem Details 响应
  const problem = buildProblemResponse(err, req);

  // 5xx 错误 → 上报监控系统
  if (err.status >= 500) {
    reportToMonitoring({
      level: 'error',
      errorCode: err.errorCode,
      message: err.message,
      stack: err.stack,
      requestId: problem.requestId,
      path: req.originalUrl,
      method: req.method,
      userAgent: req.headers['user-agent'],
      timestamp: problem.timestamp
    });
  }

  // 429 限流 → 记录指标（用于容量规划）
  if (err.status === 429) {
    recordMetric('rate_limit_hit', {
      path: req.originalUrl,
      clientId: req.headers['x-client-id']
    });
  }

  // 返回标准响应
  res.status(err.status)
    .type('application/problem+json')
    .json(problem);
}

✅ 五、最佳实践总结

实践	推荐	说明
使用 RFC 9457 标准格式	✅ 推荐	机器可解析，生态兼容
Content-Type: application/problem+json	✅ 推荐	区分正常响应与错误响应
三层错误码体系	✅ 推荐	域_类型_细节，清晰可维护
区分 retryable 字段	✅ 推荐	指导客户端自动重试策略
指数退避 + Jitter	✅ 推荐	避免重试风暴（Thundering Herd）
生产环境隐藏内部细节	✅ 推荐	防止信息泄露
只返回 { “error”: “…” }	❌ 避免	缺少结构化信息，无法自动处理
所有错误都返回 500	❌ 避免	客户端无法区分错误类型
重试无差别的错误	❌ 避免	浪费资源，可能加剧故障
暴露数据库/堆栈信息	❌ 避免	安全风险

🔧 六、相关工具推荐

🔴 AJV：高性能 JSON Schema 校验器，可自动将校验错误转为 Problem Details 格式
🟡 Zod：TypeScript 优先的运行时校验库，配合 zod-to-json-schema 可生成 JSON Schema
🟢 Sentry：错误追踪平台，支持 Problem Details 的结构化错误上报
🔵 OpenAPI Generator：从 OpenAPI 规范自动生成带错误类型定义的客户端 SDK
🟣 Hono：轻量级 Web 框架，内置 HTTPException 支持 Problem Details

统一的错误处理框架不是锦上添花，而是生产级 API 的基础设施。RFC 9457 提供了标准化的格式，但真正的价值在于：通过错误分类体系让机器能自动处理，通过清晰的错误信息让开发者能快速定位问题。如果你正在设计新的 API，从第一天就引入这套框架；如果你正在维护旧 API，可以逐步迁移——先为新端点启用 Problem Details，再用适配器模式兼容旧的错误格式。