Integrate Bright Data's powerful web scraping and data extraction capabilities into your Haystack pipelines. This package provides three Haystack components for:
- 🔍 SERP API - Search engine results from Google, Bing, Yahoo, and more
- 🌐 Web Unlocker - Access geo-restricted and bot-protected websites
- 📊 Web Scraper - Extract structured data from 43+ supported websites
- Seamless Haystack Integration - Works natively with Haystack 2.0+ pipelines
- 43+ Supported Datasets - Extract data from Amazon, LinkedIn, Instagram, Facebook, TikTok, YouTube, and more
- Geo-Targeting - Access content from specific countries
- Anti-Bot Bypass - Automatically handle CAPTCHAs and bot detection
- Structured Data - Get clean, structured JSON data ready for RAG pipelines
- Async Support - Built-in async support for high-performance applications
pip install haystack-brightdata- Get your Bright Data API key from https://brightdata.com/cp/api_access
- Set the environment variable:
export BRIGHT_DATA_API_KEY="your-api-key-here"from haystack_brightdata import BrightDataSERP
# Initialize the component
serp = BrightDataSERP()
# Execute a search
result = serp.run(
query="Haystack AI framework tutorials",
num_results=10,
country="us"
)
print(result["results"]) # Parsed JSON resultsfrom haystack_brightdata import BrightDataUnlocker
# Initialize the component
unlocker = BrightDataUnlocker()
# Access a restricted website
result = unlocker.run(
url="https://example.com",
country="gb",
output_format="markdown"
)
print(result["content"]) # Clean markdown contentfrom haystack_brightdata import BrightDataWebScraper
# Initialize the component
scraper = BrightDataWebScraper()
# Extract Amazon product data
result = scraper.run(
dataset="amazon_product",
url="https://www.amazon.com/dp/B08N5WRWNW"
)
print(result["data"]) # Structured JSON datafrom haystack import Pipeline
from haystack_brightdata import BrightDataSERP
# Create a pipeline
pipeline = Pipeline()
pipeline.add_component("search", BrightDataSERP())
# Run the pipeline
result = pipeline.run({
"search": {
"query": "Python web scraping",
"num_results": 20
}
})
print(result["search"]["results"])Execute search queries across multiple search engines with geo-targeting and result parsing.
Parameters:
bright_data_api_key(Optional[str]): API key (defaults toBRIGHT_DATA_API_KEYenv var)zone(str): Bright Data zone name (default: "serp")default_search_engine(str): Default search engine (default: "google")default_country(str): Default country code (default: "us")default_language(str): Default language code (default: "en")default_num_results(int): Default number of results (default: 10)
Outputs:
results(str): Search results as JSON string (whenparse_results=True, default) or raw HTML
Access geo-restricted and bot-protected websites with automatic CAPTCHA solving.
Parameters:
bright_data_api_key(Optional[str]): API key (defaults toBRIGHT_DATA_API_KEYenv var)zone(str): Bright Data zone name (default: "unlocker")default_country(str): Default country code (default: "us")default_output_format(str): Default output format - html, markdown, or screenshot (default: "html")
Outputs:
content(str): Web page content in the specified format
Extract structured data from 43+ supported websites.
Parameters:
bright_data_api_key(Optional[str]): API key (defaults toBRIGHT_DATA_API_KEYenv var)default_include_errors(bool): Include errors in output (default: False)
Outputs:
data(str): Structured data as JSON string
Helper Methods:
# Get all supported datasets
datasets = BrightDataWebScraper.get_supported_datasets()
# Get info about a specific dataset
info = BrightDataWebScraper.get_dataset_info("amazon_product")- Amazon: Products, Reviews, Search, Bestsellers
- Walmart: Products, Seller
- eBay, Home Depot, Zara, Etsy, Best Buy
- Person Profile, Company Profile, Job Listings, Posts, People Search
- Instagram: Profiles, Posts, Reels, Comments
- Facebook: Posts, Marketplace, Company Reviews, Events
- TikTok: Profiles, Posts, Shop, Comments
- YouTube: Profiles, Videos, Comments
- X/Twitter: Posts
- Reddit: Posts
- Crunchbase, ZoomInfo
- Google Maps Reviews, Google Shopping, Google Play Store
- Apple App Store, Zillow, Booking.com
- GitHub, Yahoo Finance, Reuters
serp = BrightDataSERP(zone="my_custom_serp_zone")result = serp.run(
query="local restaurants",
country="fr", # France
language="fr",
num_results=20
)# Get as markdown
markdown = unlocker.run(url="https://example.com", output_format="markdown")
# Get as screenshot
screenshot = unlocker.run(url="https://example.com", output_format="screenshot")# LinkedIn people search
result = scraper.run(
dataset="linkedin_people_search",
url="https://www.linkedin.com",
first_name="John",
last_name="Doe"
)
# Google Maps reviews (last 7 days)
result = scraper.run(
dataset="google_maps_reviews",
url="https://www.google.com/maps/place/...",
days_limit="7"
)BRIGHT_DATA_API_KEY- Your Bright Data API key (required)REQUESTS_CA_BUNDLE- Custom CA bundle for corporate proxies (optional)SSL_CERT_FILE- Alternative SSL certificate file (optional)
- Python >= 3.8
- haystack-ai >= 2.0.0
- pydantic >= 2.0.0
- requests >= 2.28.0
- aiohttp >= 3.8.0
Check out the examples directory for more detailed examples:
example_serp.py- SERP API examplesexample_unlocker.py- Web Unlocker examplesexample_scraper.py- Web Scraper examplesexample_pipeline.py- Pipeline integration examples
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Issues: GitHub Issues
- Bright Data Support: support@brightdata.com
- Haystack Community: Haystack Discord
- Built for Haystack by deepset
- Powered by Bright Data
Note: You need a valid Bright Data subscription to use this package. Get started at brightdata.com.