頁面擷取

透過頁面截圖和 DOM 擷取啟用視覺理解。

讓 Agent 能夠「看到」目前頁面,透過擷取頁面狀態,包含 Markdown 內容、截圖和可操作元素。

概觀

頁面擷取會提取以下頁面狀態:

  • Markdown:以 Markdown 格式呈現的頁面文字內容
  • 截圖:頁面的視覺呈現(base64 JPEG)
  • 可操作元素:頁面上所有可互動元素的結構化清單

安裝相依套件

Terminal
bun add html2canvas

完整原始碼

將以下程式碼儲存為 src/lib/page-capture.ts

src/lib/page-capture.ts
/**
 * Page Capture Utilities
 * Captures page information for Lens OS Agent context
 */

import type { PageState, ActionableElement } from "@lens-os/sdk";

// ==================== DOM to Markdown ====================

export function extractPageMarkdown(maxLength: number = 8000): string {
  if (typeof document === "undefined") return "";

  const lines: string[] = [];

  const mainContent =
    document.querySelector("main") ||
    document.querySelector("article") ||
    document.querySelector('[role="main"]') ||
    document.body;

  if (!mainContent) return "";

  processNode(mainContent, lines);

  let result = lines.join("\n").trim();
  if (result.length > maxLength) {
    result = result.substring(0, maxLength) + "\n\n[Content truncated...]";
  }

  return result;
}

function processNode(node: Node, lines: string[], depth: number = 0): void {
  if (node instanceof HTMLElement) {
    const style = window.getComputedStyle(node);
    if (style.display === "none" || style.visibility === "hidden") return;
  }

  if (node instanceof HTMLElement) {
    const skipTags = ["SCRIPT", "STYLE", "SVG", "NOSCRIPT", "IFRAME", "CANVAS", "VIDEO", "AUDIO"];
    if (skipTags.includes(node.tagName)) return;
  }

  if (node.nodeType === Node.TEXT_NODE) {
    const text = node.textContent?.trim();
    if (text) lines.push(text);
    return;
  }

  if (node instanceof HTMLElement) {
    const tag = node.tagName;

    if (/^H[1-6]$/.test(tag)) {
      const level = parseInt(tag[1]);
      const text = node.textContent?.trim();
      if (text) {
        lines.push("");
        lines.push(`${"#".repeat(level)} ${text}`);
        lines.push("");
      }
      return;
    }

    if (tag === "P") {
      const text = node.textContent?.trim();
      if (text) { lines.push(""); lines.push(text); lines.push(""); }
      return;
    }

    if (tag === "A") {
      const href = node.getAttribute("href");
      const text = node.textContent?.trim();
      if (text && href) lines.push(`[${text}](${href})`);
      else if (text) lines.push(text);
      return;
    }

    if (tag === "IMG") {
      const alt = node.getAttribute("alt");
      const src = node.getAttribute("src");
      if (alt || src) lines.push(`![${alt || "image"}](${src || ""})`);
      return;
    }

    if (tag === "UL" || tag === "OL") {
      lines.push("");
      const items = node.querySelectorAll(":scope > li");
      items.forEach((li, index) => {
        const text = li.textContent?.trim();
        if (text) lines.push(`${tag === "OL" ? `${index + 1}.` : "-"} ${text}`);
      });
      lines.push("");
      return;
    }

    if (tag === "TABLE") {
      lines.push(""); lines.push("[Table]");
      node.querySelectorAll("tr").forEach((row, rowIndex) => {
        const cells = Array.from(row.querySelectorAll("th, td")).map(c => c.textContent?.trim() || "");
        if (cells.some(t => t)) {
          lines.push(`| ${cells.join(" | ")} |`);
          if (rowIndex === 0 && row.querySelector("th"))
            lines.push(`| ${cells.map(() => "---").join(" | ")} |`);
        }
      });
      lines.push("");
      return;
    }

    if (tag === "BUTTON") {
      const text = node.textContent?.trim();
      if (text) lines.push(`[Button: ${text}]`);
      return;
    }

    if (tag === "INPUT") {
      const type = node.getAttribute("type") || "text";
      const label = node.getAttribute("aria-label") || node.getAttribute("placeholder") || node.getAttribute("name");
      lines.push(`[Input ${type}: ${label || "unnamed"}]`);
      return;
    }

    for (const child of Array.from(node.childNodes)) {
      processNode(child, lines, depth + 1);
    }
  }
}

// ==================== Screenshot Capture ====================

export async function captureScreenshot(): Promise<string> {
  if (typeof window === "undefined") return "";

  try {
    const html2canvas = (await import("html2canvas")).default;

    const canvas = await html2canvas(document.body, {
      logging: false,
      useCORS: true,
      allowTaint: true,
      scale: 0.5,
      width: window.innerWidth,
      height: window.innerHeight,
      windowWidth: window.innerWidth,
      windowHeight: window.innerHeight,
    });

    return canvas.toDataURL("image/jpeg", 0.7);
  } catch (error) {
    console.error("[PageCapture] Screenshot failed:", error);
    return "";
  }
}

// ==================== Capture Page State ====================

export async function capturePageState(
  options: {
    includeScreenshot?: boolean;
    maxMarkdownLength?: number;
    maxElements?: number;
  } = {}
): Promise<PageState> {
  const {
    includeScreenshot = true,
    maxMarkdownLength = 8000,
    maxElements = 50,
  } = options;

  const [markdown, screenshot, actionableElements] = await Promise.all([
    Promise.resolve(extractPageMarkdown(maxMarkdownLength)),
    includeScreenshot ? captureScreenshot() : Promise.resolve(""),
    Promise.resolve(extractActionableElements(maxElements)),
  ]);

  return {
    url: typeof window !== "undefined" ? window.location.href : "",
    title: typeof document !== "undefined" ? document.title : "",
    markdown,
    screenshot,
    actionableElements,
    timestamp: new Date(),
  };
}

// ==================== Actionable Elements ====================

export function extractActionableElements(maxElements: number = 50): ActionableElement[] {
  if (typeof document === "undefined") return [];

  const elements: ActionableElement[] = [];
  const seen = new Set<string>();

  const selectors = [
    "button:not([disabled])", "a[href]",
    'input:not([type="hidden"]):not([disabled])',
    "select:not([disabled])", "textarea:not([disabled])",
    '[role="button"]:not([disabled])', '[role="link"]',
    '[role="menuitem"]', "[onclick]", '[tabindex="0"]',
  ];

  for (const el of Array.from(document.querySelectorAll(selectors.join(", ")))) {
    if (elements.length >= maxElements) break;

    const htmlEl = el as HTMLElement;
    const style = window.getComputedStyle(htmlEl);
    if (style.display === "none" || style.visibility === "hidden") continue;

    const rect = htmlEl.getBoundingClientRect();
    if (rect.width === 0 || rect.height === 0) continue;

    const selector = generateSelector(htmlEl);
    if (seen.has(selector)) continue;
    seen.add(selector);

    const type = getElementType(htmlEl);
    const text = getElementText(htmlEl);

    elements.push({
      id: `el-${elements.length}`,
      type,
      selector,
      text: text || undefined,
      placeholder: htmlEl.getAttribute("placeholder") || undefined,
      description: generateDescription(htmlEl, type, text),
    });
  }

  return elements;
}

function getElementType(el: HTMLElement): ActionableElement["type"] {
  const tag = el.tagName.toLowerCase();
  if (tag === "button" || el.getAttribute("role") === "button") return "button";
  if (tag === "a" || el.getAttribute("role") === "link") return "link";
  if (tag === "input") return "input";
  if (tag === "select") return "select";
  if (tag === "textarea") return "textarea";
  return "button";
}

function getElementText(el: HTMLElement): string {
  const ariaLabel = el.getAttribute("aria-label");
  if (ariaLabel) return ariaLabel.trim();
  const title = el.getAttribute("title");
  if (title) return title.trim();
  const text = el.textContent?.trim() || "";
  if (text && text.length <= 100) return text;
  if (text) return text.substring(0, 100) + "...";
  if (el instanceof HTMLInputElement) return el.value || el.placeholder || "";
  return "";
}

function generateDescription(el: HTMLElement, type: ActionableElement["type"], text: string): string {
  switch (type) {
    case "button": return text ? `Button: "${text}"` : `Button (${el.tagName.toLowerCase()})`;
    case "link": {
      const href = el.getAttribute("href") || "";
      return text ? `Link: "${text}" → ${href.length > 50 ? href.slice(0, 50) + "..." : href}` : `Link → ${href}`;
    }
    case "input": {
      const inputType = el.getAttribute("type") || "text";
      const name = el.getAttribute("name") || el.getAttribute("id") || el.getAttribute("placeholder") || "";
      return `Input (${inputType}): ${name || "unnamed"}`;
    }
    case "select": return `Dropdown: ${el.getAttribute("name") || el.getAttribute("id") || "unnamed"}`;
    case "textarea": return `Text area: ${el.getAttribute("name") || el.getAttribute("id") || "unnamed"}`;
    default: return text || "Interactive element";
  }
}

function generateSelector(el: HTMLElement): string {
  if (el.id) return `#${CSS.escape(el.id)}`;
  const testId = el.getAttribute("data-testid") || el.getAttribute("data-cy");
  if (testId) return `[data-testid="${CSS.escape(testId)}"]`;

  const parts: string[] = [];
  let current: HTMLElement | null = el;
  let depth = 0;

  while (current && depth < 3) {
    const tag = current.tagName.toLowerCase();
    const classes = Array.from(current.classList)
      .filter(c => !c.startsWith("css-") && !c.includes("--") && c.length < 30)
      .slice(0, 2);

    let part = tag;
    if (classes.length > 0) part += "." + classes.map(c => CSS.escape(c)).join(".");

    if (current.parentElement) {
      const siblings = Array.from(current.parentElement.children).filter(s => s.tagName === current!.tagName);
      if (siblings.length > 1) part += `:nth-of-type(${siblings.indexOf(current) + 1})`;
    }

    parts.unshift(part);
    current = current.parentElement;
    depth++;
  }

  return parts.join(" > ");
}

各段說明

1. extractPageMarkdown — DOM 轉 Markdown

從 DOM 提取頁面文字並轉換成 Markdown 格式,讓 LLM 更容易理解頁面結構。優先從 mainarticle[role="main"] 等語意元素開始擷取,遞迴遍歷子節點,根據 HTML 標籤轉換成對應的 Markdown 語法(標題、段落、連結、列表、表格等),並跳過 SCRIPTSTYLE、隱藏元素等非內容節點。

2. captureScreenshot — 視覺截圖

使用 html2canvas 動態載入(tree-shaking 友善)擷取當前 viewport 截圖,輸出為 base64 JPEG。設定 scale: 0.5quality: 0.7 降低圖片大小,減少傳輸時間與 token 成本。

3. capturePageState — 整合入口

並行執行 Markdown 擷取、截圖、可操作元素擷取三個操作,並回傳包含 URL、標題、時間戳的完整 PageState 物件給 Agent。

4. extractActionableElements — 互動元素清單

查詢頁面上所有可互動元素(按鈕、連結、輸入框、下拉選單等),過濾隱藏或零尺寸元素,為每個元素生成唯一 CSS selector 與描述,讓 Agent 知道頁面上有哪些可以操作的 UI。

整合到聊天元件

handleSend 中,於送出訊息前先呼叫 capturePageState,將結果透過 pageState 參數傳入 sendMessage

整合範例
// 在 handleSend 中整合頁面擷取
const handleSend = async () => {
  let currentPage: PageState | null = null;

  if (captureEnabled) {
    setIsCapturing(true);
    try {
      currentPage = await capturePageState({
        includeScreenshot: true,
        maxMarkdownLength: 6000,
        maxElements: 30,
      });
    } catch (err) {
      console.error("[PageCapture] 擷取失敗:", err);
    } finally {
      setIsCapturing(false);
    }
  }

  await sendMessage(message, {
    currentUrl: typeof window !== "undefined" ? window.location.href : "",
    pageState: currentPage ?? undefined,
  });
};

效能提示

Tip

降低截圖解析度:使用 scale: 0.5 來減少圖片大小,降低傳輸時間和 token 成本。

Tip

限制 Markdown 長度:設定 maxMarkdownLength: 6000 以避免過長的內容影響 LLM 效能。

Tip

限制元素數量:設定 maxElements: 30 以只擷取最重要的互動元素。