ScrapeGraph MCP Server

MCPOfficialOpen SourceMIT49.9

by ScrapeGraphAI • Analytics & Monitoring

An MCP server that connects language models to the ScrapeGraph AI web‑scraping API to perform single‑page scraping, search scraping, multi‑page crawling, and agentic extraction workflows.

Example Use Cases

1
Convert single web pages (including JS-heavy pages) into clean, structured markdown for downstream analysis or summarization.
2
Crawl and extract structured data across multiple pages or an entire site with configurable depth, page limits, and asynchronous result fetching.
3
Advanced, stepwise agentic scraping workflows (forms, pagination, multi-step navigation) with custom output schemas and persistent sessions.

Description

This server exposes a production-ready Model Context Protocol (MCP) interface for the ScrapeGraph AI API, enabling LLMs and agents to request structured data extraction, markdown conversions, and multi-page crawls. It supports JavaScript-heavy rendering, infinite-scroll handling, asynchronous crawls, and customizable agentic scraping workflows with output schemas. The integration provides enterprise-grade error handling and is designed for seamless integration into tools like Claude Desktop and Cursor; an API key from ScrapeGraph is required to operate the server.

Capabilities(14 total)

Tools (8)

markdownify

Convert a webpage into clean, formatted markdown. This tool fetches any webpage and converts its content into clean, readable markdown format. Useful for extracting content from documentation, articles, and web pages for further processing. Costs 2 credits per page. Read-only operation with no side effects. Args: website_url (str): The complete URL of the webpage to convert to markdown format. - Must include protocol (http:// or https://) - Supports most web content types (HTML, articles, documentation) - Works with both static and dynamic content - Examples: * https://example.com/page * https://docs.python.org/3/tutorial/ * https://github.com/user/repo/README.md - Invalid examples: * example.com (missing protocol) * ftp://example.com (unsupported protocol) * localhost:3000 (missing protocol) Returns: Dictionary containing: - markdown: The converted markdown content as a string - metadata: Additional information about the conversion (title, description, etc.) - status: Success/error status of the operation - credits_used: Number of credits consumed (always 2 for this operation) Raises: ValueError: If website_url is malformed or missing protocol HTTPError: If the webpage cannot be accessed or returns an error TimeoutError: If the webpage takes too long to load (>120 seconds)

website_url:string*

smartscraper

Extract structured data from a webpage, HTML, or markdown using AI-powered extraction. This tool uses advanced AI to understand your natural language prompt and extract specific structured data from web content. Supports three input modes: URL scraping, local HTML processing, or local markdown processing. Ideal for extracting product information, contact details, article metadata, or any structured content. Costs 10 credits per page. Read-only operation. Args: user_prompt (str): Natural language instructions describing what data to extract. - Be specific about the fields you want for better results - Use clear, descriptive language about the target data - Examples: * "Extract product name, price, description, and availability status" * "Find all contact methods: email addresses, phone numbers, and social media links" * "Get article title, author, publication date, and summary" * "Extract all job listings with title, company, location, and salary" - Tips for better results: * Specify exact field names you want * Mention data types (numbers, dates, URLs, etc.) * Include context about where data might be located website_url (Optional[str]): The complete URL of the webpage to scrape. - Mutually exclusive with website_html and website_markdown - Must include protocol (http:// or https://) - Supports dynamic and static content - Examples: * https://example.com/products/item * https://news.site.com/article/123 * https://company.com/contact - Default: None (must provide one of the three input sources) website_html (Optional[str]): Raw HTML content to process locally. - Mutually exclusive with website_url and website_markdown - Maximum size: 2MB - Useful for processing pre-fetched or generated HTML - Use when you already have HTML content from another source - Example: "<html><body><h1>Title</h1><p>Content</p></body></html>" - Default: None website_markdown (Optional[str]): Markdown content to process locally. - Mutually exclusive with website_url and website_html - Maximum size: 2MB - Useful for extracting from markdown documents or converted content - Works well with documentation, README files, or converted web content - Example: "# Title ## Section Content here..." - Default: None output_schema (Optional[Union[str, Dict]]): JSON schema defining expected output structure. - Can be provided as a dictionary or JSON string - Helps ensure consistent, structured output format - Optional but recommended for complex extractions - Examples: * As dict: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}} * As JSON string: '{"type": "object", "properties": {"name": {"type": "string"}}}' * For arrays: {'type': 'array', 'items': {'type': 'object', 'properties': {...}}} - Default: None (AI will infer structure from prompt) number_of_scrolls (Optional[int]): Number of infinite scrolls to perform before scraping. - Range: 0-50 scrolls - Default: 0 (no scrolling) - Useful for dynamically loaded content (lazy loading, infinite scroll) - Each scroll waits for content to load before continuing - Examples: * 0: Static content, no scrolling needed * 3: Social media feeds, product listings * 10: Long articles, extensive product catalogs - Note: Increases processing time proportionally total_pages (Optional[int]): Number of pages to process for pagination. - Range: 1-100 pages - Default: 1 (single page only) - Automatically follows pagination links when available - Useful for multi-page listings, search results, catalogs - Examples: * 1: Single page extraction * 5: First 5 pages of search results * 20: Comprehensive catalog scraping - Note: Each page counts toward credit usage (10 credits × pages) render_heavy_js (Optional[bool]): Enable heavy JavaScript rendering for dynamic sites. - Default: false - Set to true for Single Page Applications (SPAs), React apps, Vue.js sites - Increases processing time but captures client-side rendered content - Use when content is loaded dynamically via JavaScript - Examples of when to use: * React/Angular/Vue applications * Sites with dynamic content loading * AJAX-heavy interfaces * Content that appears after page load - Note: Significantly increases processing time (30-60 seconds vs 5-15 seconds) stealth (Optional[bool]): Enable stealth mode to avoid bot detection. - Default: false - Helps bypass basic anti-scraping measures - Uses techniques to appear more like a human browser - Useful for sites with bot detection systems - Examples of when to use: * Sites that block automated requests * E-commerce sites with protection * Sites that require "human-like" behavior - Note: May increase processing time and is not 100% guaranteed Returns: Dictionary containing: - extracted_data: The structured data matching your prompt and optional schema - metadata: Information about the extraction process - credits_used: Number of credits consumed (10 per page processed) - processing_time: Time taken for the extraction - pages_processed: Number of pages that were analyzed - status: Success/error status of the operation Raises: ValueError: If no input source provided or multiple sources provided HTTPError: If website_url cannot be accessed TimeoutError: If processing exceeds timeout limits ValidationError: If output_schema is malformed JSON

user_prompt:string*website_url:anywebsite_html:anywebsite_markdown:anyoutput_schema:anynumber_of_scrolls:anytotal_pages:anyrender_heavy_js:anystealth:any

searchscraper

Perform AI-powered web searches with structured data extraction. This tool searches the web based on your query and uses AI to extract structured information from the search results. Ideal for research, competitive analysis, and gathering information from multiple sources. Each website searched costs 10 credits (default 3 websites = 30 credits). Read-only operation but results may vary over time (non-idempotent). Args: user_prompt (str): Search query or natural language instructions for information to find. - Can be a simple search query or detailed extraction instructions - The AI will search the web and extract relevant data from found pages - Be specific about what information you want extracted - Examples: * "Find latest AI research papers published in 2024 with author names and abstracts" * "Search for Python web scraping tutorials with ratings and difficulty levels" * "Get current cryptocurrency prices and market caps for top 10 coins" * "Find contact information for tech startups in San Francisco" * "Search for job openings for data scientists with salary information" - Tips for better results: * Include specific fields you want extracted * Mention timeframes or filters (e.g., "latest", "2024", "top 10") * Specify data types needed (prices, dates, ratings, etc.) num_results (Optional[int]): Number of websites to search and extract data from. - Default: 3 websites (costs 30 credits total) - Range: 1-20 websites (recommended to stay under 10 for cost efficiency) - Each website costs 10 credits, so total cost = num_results × 10 - Examples: * 1: Quick single-source lookup (10 credits) * 3: Standard research (30 credits) - good balance of coverage and cost * 5: Comprehensive research (50 credits) * 10: Extensive analysis (100 credits) - Note: More results provide broader coverage but increase costs and processing time number_of_scrolls (Optional[int]): Number of infinite scrolls per searched webpage. - Default: 0 (no scrolling on search result pages) - Range: 0-10 scrolls per page - Useful when search results point to pages with dynamic content loading - Each scroll waits for content to load before continuing - Examples: * 0: Static content pages, news articles, documentation * 2: Social media pages, product listings with lazy loading * 5: Extensive feeds, long-form content with infinite scroll - Note: Increases processing time significantly (adds 5-10 seconds per scroll per page) Returns: Dictionary containing: - search_results: Array of extracted data from each website found - sources: List of URLs that were searched and processed - total_websites_processed: Number of websites successfully analyzed - credits_used: Total credits consumed (num_results × 10) - processing_time: Total time taken for search and extraction - search_query_used: The actual search query sent to search engines - metadata: Additional information about the search process Raises: ValueError: If user_prompt is empty or num_results is out of range HTTPError: If search engines are unavailable or return errors TimeoutError: If search or extraction process exceeds timeout limits RateLimitError: If too many requests are made in a short time period Note: - Results may vary between calls due to changing web content (non-idempotent) - Search engines may return different results over time - Some websites may be inaccessible or block automated access - Processing time increases with num_results and number_of_scrolls - Consider using smartscraper on specific URLs if you know the target sites

user_prompt:string*num_results:anynumber_of_scrolls:any

smartcrawler_initiate

Initiate an asynchronous multi-page web crawling operation with AI extraction or markdown conversion. This tool starts an intelligent crawler that discovers and processes multiple pages from a starting URL. Choose between AI Extraction Mode (10 credits/page) for structured data or Markdown Mode (2 credits/page) for content conversion. The operation is asynchronous - use smartcrawler_fetch_results to retrieve results. Creates a new crawl request (non-idempotent, non-read-only). SmartCrawler supports two modes: - AI Extraction Mode: Extracts structured data based on your prompt from every crawled page - Markdown Conversion Mode: Converts each page to clean markdown format Args: url (str): The starting URL to begin crawling from. - Must include protocol (http:// or https://) - The crawler will discover and process linked pages from this starting point - Should be a page with links to other pages you want to crawl - Examples: * https://docs.example.com (documentation site root) * https://blog.company.com (blog homepage) * https://example.com/products (product category page) * https://news.site.com/category/tech (news section) - Best practices: * Use homepage or main category pages as starting points * Ensure the starting page has links to content you want to crawl * Consider site structure when choosing the starting URL prompt (Optional[str]): AI prompt for data extraction. - REQUIRED when extraction_mode is 'ai' - Ignored when extraction_mode is 'markdown' - Describes what data to extract from each crawled page - Applied consistently across all discovered pages - Examples: * "Extract API endpoint name, method, parameters, and description" * "Get article title, author, publication date, and summary" * "Find product name, price, description, and availability" * "Extract job title, company, location, salary, and requirements" - Tips for better results: * Be specific about fields you want from each page * Consider that different pages may have different content structures * Use general terms that apply across multiple page types extraction_mode (str): Extraction mode for processing crawled pages. - Default: "ai" - Options: * "ai": AI-powered structured data extraction (10 credits per page) - Uses the prompt to extract specific data from each page - Returns structured JSON data - More expensive but provides targeted information - Best for: Data collection, research, structured analysis * "markdown": Simple markdown conversion (2 credits per page) - Converts each page to clean markdown format - No AI processing, just content conversion - More cost-effective for content archival - Best for: Documentation backup, content migration, reading - Cost comparison: * AI mode: 50 pages = 500 credits * Markdown mode: 50 pages = 100 credits depth (Optional[int]): Maximum depth of link traversal from the starting URL. - Default: unlimited (will follow links until max_pages or no more links) - Depth levels: * 0: Only the starting URL (no link following) * 1: Starting URL + pages directly linked from it * 2: Starting URL + direct links + links from those pages * 3+: Continues following links to specified depth - Examples: * 1: Crawl blog homepage + all blog posts * 2: Crawl docs homepage + category pages + individual doc pages * 3: Deep crawling for comprehensive site coverage - Considerations: * Higher depth can lead to exponential page growth * Use with max_pages to control scope and cost * Consider site structure when setting depth max_pages (Optional[int]): Maximum number of pages to crawl in total. - Default: unlimited (will crawl until no more links or depth limit) - Recommended ranges: * 10-20: Testing and small sites * 50-100: Medium sites and focused crawling * 200-500: Large sites and comprehensive analysis * 1000+: Enterprise-level crawling (high cost) - Cost implications: * AI mode: max_pages × 10 credits * Markdown mode: max_pages × 2 credits - Examples: * 10: Quick site sampling (20-100 credits) * 50: Standard documentation crawl (100-500 credits) * 200: Comprehensive site analysis (400-2000 credits) - Note: Crawler stops when this limit is reached, regardless of remaining links same_domain_only (Optional[bool]): Whether to crawl only within the same domain. - Default: true (recommended for most use cases) - Options: * true: Only crawl pages within the same domain as starting URL - Prevents following external links - Keeps crawling focused on the target site - Reduces risk of crawling unrelated content - Example: Starting at docs.example.com only crawls docs.example.com pages * false: Allow crawling external domains - Follows links to other domains - Can lead to very broad crawling scope - May crawl unrelated or unwanted content - Use with caution and appropriate max_pages limit - Recommendations: * Use true for focused site crawling * Use false only when you specifically need cross-domain data * Always set max_pages when using false to prevent runaway crawling Returns: Dictionary containing: - request_id: Unique identifier for this crawl operation (use with smartcrawler_fetch_results) - status: Initial status of the crawl request ("initiated" or "processing") - estimated_cost: Estimated credit cost based on parameters (actual cost may vary) - crawl_parameters: Summary of the crawling configuration - estimated_time: Rough estimate of processing time - next_steps: Instructions for retrieving results Raises: ValueError: If URL is malformed, prompt is missing for AI mode, or parameters are invalid HTTPError: If the starting URL cannot be accessed RateLimitError: If too many crawl requests are initiated too quickly Note: - This operation is asynchronous and may take several minutes to complete - Use smartcrawler_fetch_results with the returned request_id to get results - Keep polling smartcrawler_fetch_results until status is "completed" - Actual pages crawled may be less than max_pages if fewer links are found - Processing time increases with max_pages, depth, and extraction_mode complexity

url:string*prompt:anyextraction_mode:stringdepth:anymax_pages:anysame_domain_only:any

smartcrawler_fetch_results

Retrieve the results of an asynchronous SmartCrawler operation. This tool fetches the results from a previously initiated crawling operation using the request_id. The crawl request processes asynchronously in the background. Keep polling this endpoint until the status field indicates 'completed'. While processing, you'll receive status updates. Read-only operation that safely retrieves results without side effects. Args: request_id: The unique request ID returned by smartcrawler_initiate. Use this to retrieve the crawling results. Keep polling until status is 'completed'. Example: 'req_abc123xyz' Returns: Dictionary containing: - status: Current status of the crawl operation ('processing', 'completed', 'failed') - results: Crawled data (structured extraction or markdown) when completed - metadata: Information about processed pages, URLs visited, and processing statistics Keep polling until status is 'completed' to get final results

request_id:string*

Prompts (2)

web_scraping_guide

A comprehensive guide to using ScapeGraph's web scraping tools effectively. This prompt provides examples and best practices for each tool in the ScapeGraph MCP server.

quick_start_examples

Quick start examples for common ScapeGraph use cases. Ready-to-use examples for immediate productivity.

Resources (4)

api_status

Current status and capabilities of the ScapeGraph API server. Provides real-time information about available tools, credit usage, and server health.

URI: scrapegraph://api/statusMIME: text/plain

common_use_cases

Common use cases and example implementations for ScapeGraph tools. Real-world examples with expected inputs and outputs.

URI: scrapegraph://examples/use-casesMIME: text/plain

parameter_reference_guide

Comprehensive parameter reference guide for all ScapeGraph MCP tools. Complete documentation of every parameter with examples, constraints, and best practices.

URI: scrapegraph://parameters/referenceMIME: text/plain

tool_comparison_guide

Detailed comparison of ScapeGraph tools to help choose the right tool for each task. Decision matrix and feature comparison across all available tools.

URI: scrapegraph://tools/comparisonMIME: text/plain

Quick Actions

View on GitHub

Security

Scanned 4 month(s) ago

Risk Level

LOW

Reads external data but no write or execute capability

Trust Score

D45/100

4/17 checks passed

Scores are informational only and provided “as is” without warranty. AgentHotspot assumes no liability for actions taken based on these ratings.

Quick Stats

Service TypeMCP

Pricing ModelUnknown

Capabilities8 Tools / 2 Prompts / 4 Resources

OwnerScrapeGraphAI

CategoryAnalytics & Monitoring

DependenciesStandalone

Set Your Username