頁面擷取
透過頁面截圖和 DOM 擷取啟用視覺理解。
讓 Agent 能夠「看到」目前頁面,透過擷取頁面狀態,包含 Markdown 內容、截圖和可操作元素。
概觀
頁面擷取會提取以下頁面狀態:
- Markdown:以 Markdown 格式呈現的頁面文字內容
- 截圖:頁面的視覺呈現(base64 JPEG)
- 可操作元素:頁面上所有可互動元素的結構化清單
安裝相依套件
Terminal
bun add html2canvas完整原始碼
將以下程式碼儲存為 src/lib/page-capture.ts:
src/lib/page-capture.ts
/**
* Page Capture Utilities
* Captures page information for Lens OS Agent context
*/
import type { PageState, ActionableElement } from "@lens-os/sdk";
// ==================== DOM to Markdown ====================
export function extractPageMarkdown(maxLength: number = 8000): string {
if (typeof document === "undefined") return "";
const lines: string[] = [];
const mainContent =
document.querySelector("main") ||
document.querySelector("article") ||
document.querySelector('[role="main"]') ||
document.body;
if (!mainContent) return "";
processNode(mainContent, lines);
let result = lines.join("\n").trim();
if (result.length > maxLength) {
result = result.substring(0, maxLength) + "\n\n[Content truncated...]";
}
return result;
}
function processNode(node: Node, lines: string[], depth: number = 0): void {
if (node instanceof HTMLElement) {
const style = window.getComputedStyle(node);
if (style.display === "none" || style.visibility === "hidden") return;
}
if (node instanceof HTMLElement) {
const skipTags = ["SCRIPT", "STYLE", "SVG", "NOSCRIPT", "IFRAME", "CANVAS", "VIDEO", "AUDIO"];
if (skipTags.includes(node.tagName)) return;
}
if (node.nodeType === Node.TEXT_NODE) {
const text = node.textContent?.trim();
if (text) lines.push(text);
return;
}
if (node instanceof HTMLElement) {
const tag = node.tagName;
if (/^H[1-6]$/.test(tag)) {
const level = parseInt(tag[1]);
const text = node.textContent?.trim();
if (text) {
lines.push("");
lines.push(`${"#".repeat(level)} ${text}`);
lines.push("");
}
return;
}
if (tag === "P") {
const text = node.textContent?.trim();
if (text) { lines.push(""); lines.push(text); lines.push(""); }
return;
}
if (tag === "A") {
const href = node.getAttribute("href");
const text = node.textContent?.trim();
if (text && href) lines.push(`[${text}](${href})`);
else if (text) lines.push(text);
return;
}
if (tag === "IMG") {
const alt = node.getAttribute("alt");
const src = node.getAttribute("src");
if (alt || src) lines.push(``);
return;
}
if (tag === "UL" || tag === "OL") {
lines.push("");
const items = node.querySelectorAll(":scope > li");
items.forEach((li, index) => {
const text = li.textContent?.trim();
if (text) lines.push(`${tag === "OL" ? `${index + 1}.` : "-"} ${text}`);
});
lines.push("");
return;
}
if (tag === "TABLE") {
lines.push(""); lines.push("[Table]");
node.querySelectorAll("tr").forEach((row, rowIndex) => {
const cells = Array.from(row.querySelectorAll("th, td")).map(c => c.textContent?.trim() || "");
if (cells.some(t => t)) {
lines.push(`| ${cells.join(" | ")} |`);
if (rowIndex === 0 && row.querySelector("th"))
lines.push(`| ${cells.map(() => "---").join(" | ")} |`);
}
});
lines.push("");
return;
}
if (tag === "BUTTON") {
const text = node.textContent?.trim();
if (text) lines.push(`[Button: ${text}]`);
return;
}
if (tag === "INPUT") {
const type = node.getAttribute("type") || "text";
const label = node.getAttribute("aria-label") || node.getAttribute("placeholder") || node.getAttribute("name");
lines.push(`[Input ${type}: ${label || "unnamed"}]`);
return;
}
for (const child of Array.from(node.childNodes)) {
processNode(child, lines, depth + 1);
}
}
}
// ==================== Screenshot Capture ====================
export async function captureScreenshot(): Promise<string> {
if (typeof window === "undefined") return "";
try {
const html2canvas = (await import("html2canvas")).default;
const canvas = await html2canvas(document.body, {
logging: false,
useCORS: true,
allowTaint: true,
scale: 0.5,
width: window.innerWidth,
height: window.innerHeight,
windowWidth: window.innerWidth,
windowHeight: window.innerHeight,
});
return canvas.toDataURL("image/jpeg", 0.7);
} catch (error) {
console.error("[PageCapture] Screenshot failed:", error);
return "";
}
}
// ==================== Capture Page State ====================
export async function capturePageState(
options: {
includeScreenshot?: boolean;
maxMarkdownLength?: number;
maxElements?: number;
} = {}
): Promise<PageState> {
const {
includeScreenshot = true,
maxMarkdownLength = 8000,
maxElements = 50,
} = options;
const [markdown, screenshot, actionableElements] = await Promise.all([
Promise.resolve(extractPageMarkdown(maxMarkdownLength)),
includeScreenshot ? captureScreenshot() : Promise.resolve(""),
Promise.resolve(extractActionableElements(maxElements)),
]);
return {
url: typeof window !== "undefined" ? window.location.href : "",
title: typeof document !== "undefined" ? document.title : "",
markdown,
screenshot,
actionableElements,
timestamp: new Date(),
};
}
// ==================== Actionable Elements ====================
export function extractActionableElements(maxElements: number = 50): ActionableElement[] {
if (typeof document === "undefined") return [];
const elements: ActionableElement[] = [];
const seen = new Set<string>();
const selectors = [
"button:not([disabled])", "a[href]",
'input:not([type="hidden"]):not([disabled])',
"select:not([disabled])", "textarea:not([disabled])",
'[role="button"]:not([disabled])', '[role="link"]',
'[role="menuitem"]', "[onclick]", '[tabindex="0"]',
];
for (const el of Array.from(document.querySelectorAll(selectors.join(", ")))) {
if (elements.length >= maxElements) break;
const htmlEl = el as HTMLElement;
const style = window.getComputedStyle(htmlEl);
if (style.display === "none" || style.visibility === "hidden") continue;
const rect = htmlEl.getBoundingClientRect();
if (rect.width === 0 || rect.height === 0) continue;
const selector = generateSelector(htmlEl);
if (seen.has(selector)) continue;
seen.add(selector);
const type = getElementType(htmlEl);
const text = getElementText(htmlEl);
elements.push({
id: `el-${elements.length}`,
type,
selector,
text: text || undefined,
placeholder: htmlEl.getAttribute("placeholder") || undefined,
description: generateDescription(htmlEl, type, text),
});
}
return elements;
}
function getElementType(el: HTMLElement): ActionableElement["type"] {
const tag = el.tagName.toLowerCase();
if (tag === "button" || el.getAttribute("role") === "button") return "button";
if (tag === "a" || el.getAttribute("role") === "link") return "link";
if (tag === "input") return "input";
if (tag === "select") return "select";
if (tag === "textarea") return "textarea";
return "button";
}
function getElementText(el: HTMLElement): string {
const ariaLabel = el.getAttribute("aria-label");
if (ariaLabel) return ariaLabel.trim();
const title = el.getAttribute("title");
if (title) return title.trim();
const text = el.textContent?.trim() || "";
if (text && text.length <= 100) return text;
if (text) return text.substring(0, 100) + "...";
if (el instanceof HTMLInputElement) return el.value || el.placeholder || "";
return "";
}
function generateDescription(el: HTMLElement, type: ActionableElement["type"], text: string): string {
switch (type) {
case "button": return text ? `Button: "${text}"` : `Button (${el.tagName.toLowerCase()})`;
case "link": {
const href = el.getAttribute("href") || "";
return text ? `Link: "${text}" → ${href.length > 50 ? href.slice(0, 50) + "..." : href}` : `Link → ${href}`;
}
case "input": {
const inputType = el.getAttribute("type") || "text";
const name = el.getAttribute("name") || el.getAttribute("id") || el.getAttribute("placeholder") || "";
return `Input (${inputType}): ${name || "unnamed"}`;
}
case "select": return `Dropdown: ${el.getAttribute("name") || el.getAttribute("id") || "unnamed"}`;
case "textarea": return `Text area: ${el.getAttribute("name") || el.getAttribute("id") || "unnamed"}`;
default: return text || "Interactive element";
}
}
function generateSelector(el: HTMLElement): string {
if (el.id) return `#${CSS.escape(el.id)}`;
const testId = el.getAttribute("data-testid") || el.getAttribute("data-cy");
if (testId) return `[data-testid="${CSS.escape(testId)}"]`;
const parts: string[] = [];
let current: HTMLElement | null = el;
let depth = 0;
while (current && depth < 3) {
const tag = current.tagName.toLowerCase();
const classes = Array.from(current.classList)
.filter(c => !c.startsWith("css-") && !c.includes("--") && c.length < 30)
.slice(0, 2);
let part = tag;
if (classes.length > 0) part += "." + classes.map(c => CSS.escape(c)).join(".");
if (current.parentElement) {
const siblings = Array.from(current.parentElement.children).filter(s => s.tagName === current!.tagName);
if (siblings.length > 1) part += `:nth-of-type(${siblings.indexOf(current) + 1})`;
}
parts.unshift(part);
current = current.parentElement;
depth++;
}
return parts.join(" > ");
}各段說明
1. extractPageMarkdown — DOM 轉 Markdown
從 DOM 提取頁面文字並轉換成 Markdown 格式,讓 LLM 更容易理解頁面結構。優先從 main、article、[role="main"] 等語意元素開始擷取,遞迴遍歷子節點,根據 HTML 標籤轉換成對應的 Markdown 語法(標題、段落、連結、列表、表格等),並跳過 SCRIPT、STYLE、隱藏元素等非內容節點。
2. captureScreenshot — 視覺截圖
使用 html2canvas 動態載入(tree-shaking 友善)擷取當前 viewport 截圖,輸出為 base64 JPEG。設定 scale: 0.5 與 quality: 0.7 降低圖片大小,減少傳輸時間與 token 成本。
3. capturePageState — 整合入口
並行執行 Markdown 擷取、截圖、可操作元素擷取三個操作,並回傳包含 URL、標題、時間戳的完整 PageState 物件給 Agent。
4. extractActionableElements — 互動元素清單
查詢頁面上所有可互動元素(按鈕、連結、輸入框、下拉選單等),過濾隱藏或零尺寸元素,為每個元素生成唯一 CSS selector 與描述,讓 Agent 知道頁面上有哪些可以操作的 UI。
整合到聊天元件
在 handleSend 中,於送出訊息前先呼叫 capturePageState,將結果透過 pageState 參數傳入 sendMessage:
整合範例
// 在 handleSend 中整合頁面擷取
const handleSend = async () => {
let currentPage: PageState | null = null;
if (captureEnabled) {
setIsCapturing(true);
try {
currentPage = await capturePageState({
includeScreenshot: true,
maxMarkdownLength: 6000,
maxElements: 30,
});
} catch (err) {
console.error("[PageCapture] 擷取失敗:", err);
} finally {
setIsCapturing(false);
}
}
await sendMessage(message, {
currentUrl: typeof window !== "undefined" ? window.location.href : "",
pageState: currentPage ?? undefined,
});
};效能提示
Tip
降低截圖解析度:使用
scale: 0.5 來減少圖片大小,降低傳輸時間和 token 成本。Tip
限制 Markdown 長度:設定
maxMarkdownLength: 6000 以避免過長的內容影響 LLM 效能。Tip
限制元素數量:設定
maxElements: 30 以只擷取最重要的互動元素。