jez500/seleniumbase-scrapperThis project exposes an HTTP API for SeleniumBase, allowing you to fetch web page content and metadata via HTTP requests.
Its API is based on scrapper (which uses playwrite) and uses the same parameters and response format so it should be interchangeable with it.
Start the container with the API server:
bashdocker run -d -p 8000:8000 --name seleniumbase_scrapper jez500/seleniumbase-scrapper
The API will be automatically started and available on port 8000.
You can set default values for all API parameters using environment variables. This is useful for configuring the behavior of the API without changing query parameters for each request.
Example with environment variables:
bashdocker run -d -p 8000:8000 \ -e DEFAULT_CACHE=true \ -e DEFAULT_FULL_CONTENT=false \ -e DEFAULT_SCREENSHOT=false \ -e DEFAULT_INCOGNITO=true \ -e DEFAULT_TIMEOUT=60000 \ -e DEFAULT_SLEEP=1000 \ --name seleniumbase_scrapper jez500/seleniumbase-scrapper
All available environment variables:
DEFAULT_CACHE (default: false)DEFAULT_CACHE_TTL (default: 3600 - cache time-to-live in seconds, 60 minutes)DEFAULT_FULL_CONTENT (default: false)DEFAULT_SCREENSHOT (default: false)DEFAULT_USER_SCRIPTS (default: empty)DEFAULT_USER_SCRIPTS_TIMEOUT (default: 0)DEFAULT_INCOGNITO (default: true)DEFAULT_TIMEOUT (default: 60000)DEFAULT_WAIT_UNTIL (default: domcontentloaded)DEFAULT_SLEEP (default: 0)DEFAULT_RESOURCE (default: empty, all resources allowed)DEFAULT_VIEWPORT_WIDTH (default: empty)DEFAULT_VIEWPORT_HEIGHT (default: empty)DEFAULT_SCREEN_WIDTH (default: empty)DEFAULT_SCREEN_HEIGHT (default: empty)DEFAULT_DEVICE (default: Desktop Chrome)DEFAULT_SCROLL_DOWN (default: 0)DEFAULT_IGNORE_HTTPS_ERRORS (default: true)DEFAULT_USER_AGENT (default: empty)DEFAULT_LOCALE (default: empty)DEFAULT_TIMEZONE (default: empty)DEFAULT_HTTP_CREDENTIALS (default: empty)DEFAULT_EXTRA_HTTP_HEADERS (default: empty)User scripts allow you to execute custom JavaScript code on the page after it loads but before article extraction begins. This is useful for:
To use user scripts:
api/user_scripts directoryuser-scripts parameterExample user script (api/user_scripts/example-remove-ads.js):
javascript// Remove common ad elements (function() { const adSelectors = ['.advertisement', '.ad-container', '.ads', '#ad']; adSelectors.forEach(selector => { document.querySelectorAll(selector).forEach(el => el.remove()); }); })();
Usage:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&user-scripts=example-remove-ads.js"
Multiple scripts can be specified separated by commas:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&user-scripts=remove-ads.js,accept-cookies.js"
Fetches article content and metadata from a specified URL using SeleniumBase.
Parameters:
All parameters except url have default values that can be set via environment variables (with DEFAULT_ prefix).
| Parameter | Description | Default | Env Variable |
|---|---|---|---|
url | Page URL. The page should contain the text of the article that needs to be extracted. | (required) | - |
cache | All scraping results are always saved to disk. This parameter determines whether to retrieve results from cache or execute a new request. When set to true, existing cached results will be returned if available. By default, cache reading is disabled, so each request is processed anew. | false | DEFAULT_CACHE |
full-content | If this option is set to true, the result will have the full HTML contents of the page (fullContent field in the response). | false | DEFAULT_FULL_CONTENT |
screenshot | If this option is set to true, the result will have the link to the screenshot of the page (screenshotUri field in the response). Scrapper initially attempts to take a screenshot of the entire scrollable page. If it fails because the image is too large, it will only capture the currently visible viewport. | false | DEFAULT_SCREENSHOT |
user-scripts | To use your JavaScript scripts on a webpage, put your script files into the user_scripts directory. Then, list the scripts you need in the user-scripts parameter, separating them with commas. These scripts will run after the page loads but before the article parser starts. This means you can use these scripts to do things like remove ad blocks or automatically click the cookie acceptance button. Keep in mind, script names cannot include commas, as they are used for separation.For example, you might pass example-remove-ads.js. | DEFAULT_USER_SCRIPTS | |
user-scripts-timeout | Waits for the given timeout in milliseconds after users scripts injection. For example if you want to navigate through page to specific content, set a longer period (higher value). The default value is 0, which means no sleep. | 0 | DEFAULT_USER_SCRIPTS_TIMEOUT |
| Parameter | Description | Default | Env Variable |
|---|---|---|---|
incognito | Allows creating incognito browser contexts. Incognito browser contexts don't write any browsing data to disk. | true | DEFAULT_INCOGNITO |
timeout | Maximum operation time to navigate to the page in milliseconds; defaults to 60000 (60 seconds). Pass 0 to disable the timeout. | 60000 | DEFAULT_TIMEOUT |
wait-until | When to *** navigation succeeded, defaults to domcontentloaded. Events can be either:load - *** operation to be finished when the load event is fired.domcontentloaded - *** operation to be finished when the DOMContentLoaded event is fired.networkidle - *** operation to be finished when there are no network connections for at least 500 ms.commit - *** operation to be finished when network response is received and the document started loading. | domcontentloaded | DEFAULT_WAIT_UNTIL |
sleep | Waits for the given timeout in milliseconds before parsing the article, and after the page has loaded. In many cases, a sleep timeout is not necessary. However, for some websites, it can be quite useful. Other waiting mechanisms, such as waiting for selector visibility, are not currently supported. The default value is 0, which means no sleep. | 0 | DEFAULT_SLEEP |
resource | List of resource types allowed to be loaded on the page. All other resources will not be allowed, and their network requests will be aborted. By default, all resource types are allowed. The following resource types are supported: document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other. Example: document,stylesheet,fetch. | DEFAULT_RESOURCE | |
viewport-width | The viewport width in pixels. It's better to use the device parameter instead of specifying it explicitly. | DEFAULT_VIEWPORT_WIDTH | |
viewport-height | The viewport height in pixels. It's better to use the device parameter instead of specifying it explicitly. | DEFAULT_VIEWPORT_HEIGHT | |
screen-width | The page width in pixels. Emulates consistent window screen size available inside web page via window.screen. Is only used when the viewport is set. | DEFAULT_SCREEN_WIDTH | |
screen-height | The page height in pixels. | DEFAULT_SCREEN_HEIGHT | |
device | Simulates browser behavior for a specific device, such as user agent, screen size, viewport, and whether it has touch enabled. Individual parameters like user-agent, viewport-width, and viewport-height can also be used; in such cases, they will override the device settings. | Desktop Chrome | DEFAULT_DEVICE |
scroll-down | Scroll down the page by a specified number of pixels. This is particularly useful when dealing with lazy-loading pages (pages that are loaded only as you scroll down). This parameter is used in conjunction with the sleep parameter. Make sure to set a positive value for the sleep parameter, otherwise, the scroll function won't work. | 0 | DEFAULT_SCROLL_DOWN |
ignore-https-errors | Whether to ignore HTTPS errors when sending network requests. The default setting is to ignore HTTPS errors. | true | DEFAULT_IGNORE_HTTPS_ERRORS |
user-agent | Specific user agent. It's better to use the device parameter instead of specifying it explicitly. | DEFAULT_USER_AGENT | |
locale | Specify user locale, for example en-GB, de-DE, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules. | DEFAULT_LOCALE | |
timezone | Changes the timezone of the context. See ICU's metaZones.txt for a list of supported timezone IDs. | DEFAULT_TIMEZONE | |
http-credentials | Credentials for HTTP authentication (string containing username and password separated by a colon, e.g. username:password). | DEFAULT_HTTP_CREDENTIALS | |
extra-http-headers | Contains additional HTTP headers to be sent with every request. Example: X-API-Key:***;X-Auth-Token:abcdef. | DEFAULT_EXTRA_HTTP_HEADERS |
Examples:
Basic usage:
bashcurl -X GET "http://localhost:8000/api/article?url=https://en.***.org/wiki/web_scraping"
With caching enabled:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&cache=true"
With full content and screenshot:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&full-content=true&screenshot=true"
With custom viewport and sleep:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&viewport-width=1024&viewport-height=768&sleep=2000"
With user scripts:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&user-scripts=example-remove-ads.js&user-scripts-timeout=1000"
With scroll for lazy-loading content:
bashcurl -X GET "http://localhost:8000/api/article?url=https://example.com&scroll-down=1000&sleep=2000"
Response Fields:
The response to the /api/article request returns a JSON object containing the following fields:
| Parameter | Description | Type |
|---|---|---|
byline | author metadata | null or str |
content | HTML string of processed article content | null or str |
dir | content direction | null or str |
excerpt | article description, or short excerpt from the content | null or str |
fullContent | full HTML contents of the page | null or str |
id | unique result ID | str |
url | page URL after redirects, may not match the query URL | str |
domain | page's registered domain | str |
lang | content language | null or str |
length | length of extracted article, in characters | null or int |
date | date of extracted article in ISO 8601 format | str |
query | request parameters | object |
meta | social meta tags (open graph, ***) | object |
resultUri | URL of the current result, the data here is always taken from cache | str |
screenshotUri | URL of the screenshot of the page | null or str |
siteName | name of the site | null or str |
textContent | text content of the article, with all the HTML tags removed | null or str |
title | article title | null or str |
publishedTime | article publication time | null or str |
Example Response:
json{ "id": "13cfc98ddfe0fd340fbccd298ada8c17", "url": "[***]", "domain": "en.***.org", "title": "Web scraping - ***", "byline": null, "excerpt": null, "siteName": null, "content": "<article>...</article>", "textContent": "Web scraping - ***\nJump to content...", "length": 27104, "lang": "en", "dir": "ltr", "publishedTime": null, "fullContent": "<html>...</html>", "date": "2025-11-11T22:37:42.235424Z", "query": { "url": "[***]" }, "meta": { "og_title": "Web scraping - ***", "og_type": "website" }, "resultUri": "api://article/13cfc98ddfe0fd340fbccd298ada8c17", "screenshotUri": null }
Error Handling:
Error responses follow this structure:
json{ "detail": [ { "type": "error_type", "msg": "some message" } ] }
For detailed error information, consult the Docker container logs.
Response Codes:
Health check endpoint to verify the API is running.
Example:
bashcurl -X GET "http://localhost:8000/health"
Response:
json{ "status": "healthy", "service": "jez500/seleniumbase-scrapper" }
Root endpoint that provides API documentation.
Example:
bashcurl -X GET "http://localhost:8000/"
Response:
json{ "service": "SeleniumBase API", "version": "1.0.0", "endpoints": { "/api/article": { "method": "GET", "description": "Fetch HTML content from a URL", "parameters": { "url": "The URL to fetch (required)" }, "example": "/api/article?url=[***]" }, "/health": { "method": "GET", "description": "Health check endpoint" } } }
bashcurl -X GET "http://localhost:8000/api/article?url=https://en.***.org/wiki/Python_(programming_language)"
bashcurl -X GET "http://localhost:8000/api/article?url=https://[***]"
bashcurl -X GET "http://localhost:8000/api/article?url=https://[***]" -o output.html
bashdocker logs seleniumbase_scrapper
bashdocker stop seleniumbase_scrapper
bashdocker start seleniumbase_scrapper
bashdocker rm -f seleniumbase_scrapper
You can also run the container interactively while still having the API available:
bashdocker run -it -p 8000:8000 --name seleniumbase_scrapper jez500/seleniumbase-scrapper
The API server will start automatically in the background, and you'll have access to a bash shell.
bashcurl http://localhost:8000/health
bashdocker exec -it jez500/seleniumbase-scrapper bash # Then check for the API process ps aux | grep python
If port 8000 is already in use, map to a different port:
bashdocker run -d -p 9000:8000 --name seleniumbase_scrapper jez500/seleniumbase-scrapper curl -X GET "http://localhost:9000/api/article?url=https://example.com"
探索更多轩辕镜像的使用方法,找到最适合您系统的配置方式
通过 Docker 登录认证访问私有仓库
在 Linux 系统配置镜像服务
在 Docker Desktop 配置镜像
Docker Compose 项目配置
Kubernetes 集群配置 Containerd
K3s 轻量级 Kubernetes 镜像加速
VS Code Dev Containers 配置
MacOS OrbStack 容器配置
在宝塔面板一键配置镜像
Synology 群晖 NAS 配置
飞牛 fnOS 系统配置镜像
极空间 NAS 系统配置服务
爱快 iKuai 路由系统配置
绿联 NAS 系统配置镜像
QNAP 威联通 NAS 配置
Podman 容器引擎配置
HPC 科学计算容器配置
ghcr、Quay、nvcr 等镜像仓库
无需登录使用专属域名
需要其他帮助?请查看我们的 常见问题Docker 镜像访问常见问题解答 或 提交工单
免费版仅支持 Docker Hub 访问,不承诺可用性和速度;专业版支持更多镜像源,保证可用性和稳定速度,提供优先客服响应。
专业版支持 docker.io、gcr.io、ghcr.io、registry.k8s.io、nvcr.io、quay.io、mcr.microsoft.com、docker.elastic.co 等;免费版仅支持 docker.io。
当返回 402 Payment Required 错误时,表示流量已耗尽,需要充值流量包以恢复服务。
通常由 Docker 版本过低导致,需要升级到 20.x 或更高版本以支持 V2 协议。
先检查 Docker 版本,版本过低则升级;版本正常则验证镜像信息是否正确。
使用 docker tag 命令为镜像打上新标签,去掉域名前缀,使镜像名称更简洁。
来自真实用户的反馈,见证轩辕镜像的优质服务