ChatGPT said:
\nPython is a high-level, easy-to-read programming language that lets you write powerful software quickly with clear, expressive code.
\ndiff --git a/README.md b/README.md index f4e1223..1dbf8c3 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,54 @@ # Bright Data Python SDK ๐ -[](https://github.com/vzucher/brightdata-sdk-python) +[](https://github.com/brightdata/sdk-python) [](https://www.python.org/) [](LICENSE) -[](https://github.com/vzucher/brightdata-sdk-python) +[](https://github.com/brightdata/sdk-python) [](notebooks/) Modern async-first Python SDK for [Bright Data](https://brightdata.com) APIs with **dataclass payloads**, **Jupyter notebooks**, comprehensive platform support, and **CLI tool** - built for data scientists and developers. --- +## ๐ Table of Contents + +- [โจ Features](#-features) +- [๐ Jupyter Notebooks](#-jupyter-notebooks-new) +- [๐ฆ Installation](#-installation) +- [๐ Quick Start](#-quick-start) + - [Authentication](#authentication) + - [Simple Web Scraping](#simple-web-scraping) + - [Using Dataclass Payloads](#using-dataclass-payloads-type-safe-) + - [Pandas Integration](#pandas-integration-for-data-scientists-) + - [Platform-Specific Scraping](#platform-specific-scraping) + - [Search Engine Results (SERP)](#search-engine-results-serp) + - [Async Usage](#async-usage) +- [๐ What's New in v2.0.0](#-whats-new-in-v2-200) +- [๐๏ธ Architecture](#๏ธ-architecture) +- [๐ API Reference](#-api-reference) + - [Client Initialization](#client-initialization) + - [Connection Testing](#connection-testing) + - [Zone Management](#zone-management) + - [Result Objects](#result-objects) +- [๐ฅ๏ธ CLI Usage](#๏ธ-cli-usage) +- [๐ผ Pandas Integration](#-pandas-integration) +- [๐จ Dataclass Payloads](#-dataclass-payloads) +- [๐ง Advanced Usage](#-advanced-usage) +- [๐งช Testing](#-testing) +- [๐๏ธ Design Philosophy](#๏ธ-design-philosophy) +- [๐ Documentation](#-documentation) +- [๐ง Troubleshooting](#-troubleshooting) +- [๐ค Contributing](#-contributing) +- [๐ Project Stats](#-project-stats) +- [๐ License](#-license) +- [๐ Links](#-links) +- [๐ก Examples](#-examples) +- [๐ฏ Roadmap](#-roadmap) +- [๐ Acknowledgments](#-acknowledgments) +- [๐ Why Choose This SDK?](#-why-choose-this-sdk) + +--- + ## โจ Features ### ๐ฏ **For Data Scientists** @@ -44,11 +83,11 @@ Modern async-first Python SDK for [Bright Data](https://brightdata.com) APIs wit Perfect for data scientists! Interactive tutorials with examples: -1. **[01_quickstart.ipynb](notebooks/01_quickstart.ipynb)** - Get started in 5 minutes [](https://colab.research.google.com/github/vzucher/brightdata-sdk-python/blob/master/notebooks/01_quickstart.ipynb) -2. **[02_pandas_integration.ipynb](notebooks/02_pandas_integration.ipynb)** - Work with DataFrames [](https://colab.research.google.com/github/vzucher/brightdata-sdk-python/blob/master/notebooks/02_pandas_integration.ipynb) -3. **[03_amazon_scraping.ipynb](notebooks/03_amazon_scraping.ipynb)** - Amazon deep dive [](https://colab.research.google.com/github/vzucher/brightdata-sdk-python/blob/master/notebooks/03_amazon_scraping.ipynb) -4. **[04_linkedin_jobs.ipynb](notebooks/04_linkedin_jobs.ipynb)** - Job market analysis [](https://colab.research.google.com/github/vzucher/brightdata-sdk-python/blob/master/notebooks/04_linkedin_jobs.ipynb) -5. **[05_batch_processing.ipynb](notebooks/05_batch_processing.ipynb)** - Scale to 1000s of URLs [](https://colab.research.google.com/github/vzucher/brightdata-sdk-python/blob/master/notebooks/05_batch_processing.ipynb) +1. **[01_quickstart.ipynb](notebooks/01_quickstart.ipynb)** - Get started in 5 minutes [](https://colab.research.google.com/github/brightdata/sdk-python/blob/master/notebooks/01_quickstart.ipynb) +2. **[02_pandas_integration.ipynb](notebooks/02_pandas_integration.ipynb)** - Work with DataFrames [](https://colab.research.google.com/github/brightdata/sdk-python/blob/master/notebooks/02_pandas_integration.ipynb) +3. **[03_amazon_scraping.ipynb](notebooks/03_amazon_scraping.ipynb)** - Amazon deep dive [](https://colab.research.google.com/github/brightdata/sdk-python/blob/master/notebooks/03_amazon_scraping.ipynb) +4. **[04_linkedin_jobs.ipynb](notebooks/04_linkedin_jobs.ipynb)** - Job market analysis [](https://colab.research.google.com/github/brightdata/sdk-python/blob/master/notebooks/04_linkedin_jobs.ipynb) +5. **[05_batch_processing.ipynb](notebooks/05_batch_processing.ipynb)** - Scale to 1000s of URLs [](https://colab.research.google.com/github/brightdata/sdk-python/blob/master/notebooks/05_batch_processing.ipynb) --- @@ -61,8 +100,8 @@ pip install brightdata-sdk Or install from source: ```bash -git clone https://github.com/vzucher/brightdata-sdk-python.git -cd brightdata-sdk-python +git clone https://github.com/brightdata/sdk-python.git +cd sdk-python pip install -e . ``` @@ -198,6 +237,21 @@ result = client.scrape.amazon.reviews( result = client.scrape.amazon.sellers( url="https://amazon.com/sp?seller=AXXXXXXXXX" ) + +# NEW: Search Amazon by keyword and filters +result = client.search.amazon.products( + keyword="laptop", + min_price=50000, # $500 in cents + max_price=200000, # $2000 in cents + prime_eligible=True, + condition="new" +) + +# Search by category +result = client.search.amazon.products( + keyword="wireless headphones", + category="electronics" +) ``` #### LinkedIn Data @@ -235,8 +289,8 @@ result = client.search.linkedin.profiles( result = client.search.linkedin.posts( profile_url="https://linkedin.com/in/johndoe", - start_date="2024-01-01", - end_date="2024-12-31" + start_date="2025-01-01", + end_date="2025-12-31" ) ``` @@ -264,8 +318,8 @@ result = client.scrape.chatgpt.prompts( result = client.scrape.facebook.posts_by_profile( url="https://facebook.com/profile", num_of_posts=10, - start_date="01-01-2024", - end_date="12-31-2024", + start_date="01-01-2025", + end_date="12-31-2025", timeout=240 ) @@ -286,8 +340,8 @@ result = client.scrape.facebook.posts_by_url( result = client.scrape.facebook.comments( url="https://facebook.com/post/123456", num_of_comments=100, - start_date="01-01-2024", - end_date="12-31-2024", + start_date="01-01-2025", + end_date="12-31-2025", timeout=240 ) @@ -330,8 +384,8 @@ result = client.scrape.instagram.reels( result = client.search.instagram.posts( url="https://instagram.com/username", num_of_posts=10, - start_date="01-01-2024", - end_date="12-31-2024", + start_date="01-01-2025", + end_date="12-31-2025", post_type="reel", timeout=240 ) @@ -340,8 +394,8 @@ result = client.search.instagram.posts( result = client.search.instagram.reels( url="https://instagram.com/username", num_of_posts=50, - start_date="01-01-2024", - end_date="12-31-2024", + start_date="01-01-2025", + end_date="12-31-2025", timeout=240 ) ``` @@ -403,7 +457,16 @@ asyncio.run(scrape_multiple()) --- -## ๐ What's New in v26.11.24 +## ๐ What's New in v2 2.0.0 + +### ๐ **Latest Updates (December 2025)** +- โ **Amazon Search API** - NEW parameter-based product discovery +- โ **LinkedIn Job Search Fixed** - Now builds URLs from keywords internally +- โ **Trigger Interface** - Manual trigger/poll/fetch control for all platforms +- โ **Auto-Create Zones** - Now enabled by default (was opt-in) +- โ **Improved Zone Names** - `sdk_unlocker`, `sdk_serp`, `sdk_browser` +- โ **26 Sync Wrapper Fixes** - All platform scrapers now work without context managers +- โ **Zone Manager Tests Fixed** - All 22 tests passing ### ๐ **For Data Scientists** - โ **5 Jupyter Notebooks** - Complete interactive tutorials @@ -422,17 +485,18 @@ asyncio.run(scrape_multiple()) ### ๐ฅ๏ธ **CLI Tool** - โ **`brightdata` command** - Use SDK from terminal -- โ **Scrape operations** - `brightdata scrape amazon products --url ...` -- โ **Search operations** - `brightdata search linkedin jobs --keyword ...` +- โ **Scrape operations** - `brightdata scrape amazon products ...` +- โ **Search operations** - `brightdata search amazon products --keyword ...` - โ **Output formats** - JSON, pretty-print, minimal ### ๐๏ธ **Architecture Improvements** - โ **Single AsyncEngine** - Shared across all scrapers (8x efficiency) - โ **Resource Optimization** - Reduced memory footprint - โ **Enhanced Error Messages** - Clear, actionable error messages -- โ **502+ Tests** - Comprehensive test coverage +- โ **500+ Tests Passing** - Comprehensive test coverage (99.4%) -### ๐ **New Platforms** +### ๐ **Platforms & Features** +- โ **Amazon Search** - Keyword-based product discovery - โ **Facebook Scraper** - Posts (profile/group/URL), Comments, Reels - โ **Instagram Scraper** - Profiles, Posts, Comments, Reels - โ **Instagram Search** - Posts and Reels discovery with filters @@ -456,6 +520,7 @@ client.scrape.instagram.profiles(url="...") client.scrape.generic.url(url="...") # Parameter-based discovery (search namespace) +client.search.amazon.products(keyword="...", min_price=..., max_price=...) client.search.linkedin.jobs(keyword="...", location="...") client.search.instagram.posts(url="...", num_of_posts=10) client.search.google(query="...") @@ -490,11 +555,11 @@ client = BrightDataClient( token="your_token", # Auto-loads from BRIGHTDATA_API_TOKEN if not provided customer_id="your_customer_id", # Auto-loads from BRIGHTDATA_CUSTOMER_ID (optional) timeout=30, # Default timeout in seconds - web_unlocker_zone="sdk_unlocker", # Web Unlocker zone name - serp_zone="sdk_serp", # SERP API zone name - browser_zone="sdk_browser", # Browser API zone name - auto_create_zones=False, # Auto-create missing zones - validate_token=False # Validate token on init + web_unlocker_zone="sdk_unlocker", # Web Unlocker zone name (default) + serp_zone="sdk_serp", # SERP API zone name (default) + browser_zone="sdk_browser", # Browser API zone name (default) + auto_create_zones=True, # Auto-create missing zones (default: True) + validate_token=False # Validate token on init (default: False) ) ``` @@ -639,6 +704,7 @@ brightdata scrape generic \ - `brightdata scrape generic url` **Search Operations:** +- `brightdata search amazon products` - `brightdata search linkedin jobs/profiles/posts` - `brightdata search instagram posts/reels` - `brightdata search google/bing/yandex` @@ -1079,8 +1145,8 @@ Contributions are welcome! Please see [CONTRIBUTING.md](docs/contributing.md) fo ### Development Setup ```bash -git clone https://github.com/vzucher/brightdata-sdk-python.git -cd brightdata-sdk-python +git clone https://github.com/brightdata/sdk-python.git +cd sdk-python # Install with dev dependencies pip install -e ".[dev]" @@ -1120,8 +1186,8 @@ MIT License - see [LICENSE](LICENSE) file for details. - [Bright Data](https://brightdata.com) - Get your API token - [API Documentation](https://docs.brightdata.com) -- [GitHub Repository](https://github.com/vzucher/brightdata-sdk-python) -- [Issue Tracker](https://github.com/vzucher/brightdata-sdk-python/issues) +- [GitHub Repository](https://github.com/brightdata/sdk-python) +- [Issue Tracker](https://github.com/brightdata/sdk-python/issues) --- diff --git a/audit.md b/audit.md deleted file mode 100644 index e457a33..0000000 --- a/audit.md +++ /dev/null @@ -1,963 +0,0 @@ -# Bright Data Python SDK - Enterprise-Grade Audit Report -## FAANG-Level Code Review & Architecture Analysis - -**Date:** November 24, 2025 -**Version:** 2.0.0 -**Reviewer:** Senior SDK Architect -**Scope:** Complete end-to-end analysis of codebase, architecture, performance, and enterprise standards - ---- - -## Executive Summary - -**Overall Grade: A- (88/100)** - -The Bright Data Python SDK demonstrates **strong enterprise-grade qualities** with modern async-first architecture, comprehensive error handling, and excellent separation of concerns. The recent AsyncEngine duplication fix significantly improved resource efficiency. However, there are opportunities for enhancement in documentation, configuration management, and observability. - -### Key Strengths โ -1. **Modern async-first architecture** with proper resource management -2. **Excellent separation of concerns** (API, Core, Scrapers, Models) -3. **Comprehensive error hierarchy** with 7 specialized exception types -4. **Rich result models** with validation, serialization, and timing breakdown -5. **Strong type safety** with TypedDict definitions (305 lines of types) -6. **Proper dependency injection** eliminating resource duplication -7. **Unified workflow pattern** (trigger/poll/fetch) for consistency -8. **27 test files** covering unit, integration, and e2e scenarios - -### Critical Improvements Needed โ ๏ธ -1. **Structured logging** (currently empty modules) -2. **Configuration management** (empty config.py) -3. **Observability/metrics** (no distributed tracing) -4. **Connection pooling limits** need documentation -5. **Retry strategies** could be more sophisticated -6. **API versioning strategy** needs clarity - ---- - -## ๐ Codebase Metrics - -| Metric | Value | Grade | -|--------|-------|-------| -| **Total Python Files** | 275 | โ Well-organized | -| **Lines of Code** | ~9,085 | โ Maintainable | -| **Test Files** | 27 | โ Good coverage | -| **Async Functions** | 150+ | โ Modern | -| **Exception Types** | 7 | โ Comprehensive | -| **Type Definitions** | 305 lines | โ Excellent | -| **TODO/FIXME** | 0 | โ Clean | -| **Test Ratio** | ~30:1 (code:test) | โ ๏ธ Could be better | - ---- - -## ๐๏ธ Architecture Review - -### Grade: A (92/100) - -#### โ Strengths - -1. **Layered Architecture (Excellent)** -``` -brightdata/ -โโโ client.py # Public API (facade pattern) -โโโ core/ # Foundation layer -โ โโโ engine.py # HTTP engine (resource management) -โ โโโ auth.py # Authentication (empty - needs impl) -โ โโโ logging.py # Logging (empty - needs impl) -โ โโโ zone_manager.py -โโโ api/ # Service layer -โ โโโ base.py # Base API class -โ โโโ scrape_service.py -โ โโโ search_service.py -โ โโโ crawler_service.py -โ โโโ serp/ # SERP-specific -โ โโโ browser/ # Browser automation -โโโ scrapers/ # Business logic layer -โ โโโ base.py # BaseWebScraper (inheritance) -โ โโโ workflow.py # Trigger/Poll/Fetch pattern -โ โโโ amazon/ -โ โโโ linkedin/ -โ โโโ facebook/ -โ โโโ instagram/ -โ โโโ chatgpt/ -โโโ models.py # Data layer (rich models) -โโโ types.py # Type definitions (TypedDict) -โโโ exceptions/ # Error handling -โโโ utils/ # Shared utilities -``` - -**Analysis:** -- โ Clear separation of concerns (API, Core, Business Logic, Data) -- โ Facade pattern in `BrightDataClient` provides unified interface -- โ Dependency injection used throughout (engine, api_client, workflow) -- โ Single responsibility principle applied consistently -- โ Open/Closed principle (extensible via inheritance) - -2. **AsyncEngine Resource Management (Excellent after fix)** -```python -# BEFORE FIX: โ Each scraper created own engine -client.engine โ AsyncEngine #1 -client.scrape.amazon.engine โ AsyncEngine #2 # DUPLICATE! -client.scrape.linkedin.engine โ AsyncEngine #3 # DUPLICATE! - -# AFTER FIX: โ Single engine shared across all scrapers -client.engine โ AsyncEngine #1 (SINGLE SOURCE OF TRUTH) -client.scrape.amazon.engine โ #1 # SHARED! -client.scrape.linkedin.engine โ #1 # SHARED! -``` - -**Impact:** -- โ 8x reduction in resource usage -- โ Unified rate limiting -- โ Better connection reuse -- โ Simplified debugging - -3. **Context Manager Pattern (Excellent)** -```python -# Proper resource lifecycle management -async with client: # Opens engine session - result = await client.scrape.amazon.products(...) - # Engine session reused -# Session closed automatically -``` - -**Analysis:** -- โ Idempotent `__aenter__` (safe for nested usage) -- โ Proper cleanup in `__aexit__` with 0.1s delay -- โ `force_close=True` on connector prevents warnings -- โ Rate limiter created per event loop (thread-safe) - -#### โ ๏ธ Areas for Improvement - -1. **Empty Core Modules (Critical)** -```python -# src/brightdata/core/auth.py -"""Authentication handling.""" -# EMPTY - only 1 line! - -# src/brightdata/core/logging.py -"""Structured logging.""" -# EMPTY - only 1 line! -``` - -**Recommendation:** -- Implement structured logging with correlation IDs -- Add authentication helpers (token validation, refresh logic) -- Create observability hooks for APM integration - -2. **Configuration Management (Critical)** -```python -# src/brightdata/config.py -"""Configuration (Pydantic Settings).""" -# EMPTY - only 1 line! -``` - -**Recommendation:** -```python -from pydantic_settings import BaseSettings - -class BrightDataSettings(BaseSettings): - """SDK configuration via environment variables or .env files.""" - - api_token: str - customer_id: Optional[str] = None - timeout: int = 30 - rate_limit: int = 10 - rate_period: float = 1.0 - - # Connection pool settings - max_connections: int = 100 - max_connections_per_host: int = 30 - - # Retry settings - max_retries: int = 3 - retry_backoff_factor: float = 2.0 - - # Observability - enable_tracing: bool = False - log_level: str = "INFO" - - class Config: - env_prefix = "BRIGHTDATA_" - env_file = ".env" -``` - -3. **Protocol Definitions (Empty)** -```python -# src/brightdata/protocols.py -"""Interface definitions (typing.Protocol).""" -# EMPTY! -``` - -**Recommendation:** -Define protocols for: -- `Scraper` protocol (for type checking) -- `Engine` protocol (for mocking/testing) -- `ResultFormatter` protocol (for custom formatters) - ---- - -## ๐ Performance Analysis - -### Grade: A- (88/100) - -#### โ Strengths - -1. **Async/Await Throughout (Excellent)** -```python -# All I/O operations are async -async def scrape_async(self, urls: Union[str, List[str]]) -> ScrapeResult: - async with self.engine: # Non-blocking session - result = await self.api_client.trigger(...) # Non-blocking HTTP - result = await self.workflow_executor.execute(...) # Non-blocking polling -``` - -**Metrics:** -- โ 150+ async functions -- โ Zero blocking I/O in hot paths -- โ Concurrent request support via `asyncio.gather()` - -2. **Connection Pooling (Good)** -```python -connector = aiohttp.TCPConnector( - limit=100, # Total connection limit - limit_per_host=30, # Per-host limit - force_close=True # Prevent unclosed warnings -) -``` - -**Analysis:** -- โ Reasonable limits (100 total, 30 per host) -- โ ๏ธ Hard-coded limits (should be configurable) -- โ Force close prevents resource leaks - -3. **Rate Limiting (Good)** -```python -if HAS_RATE_LIMITER and self._rate_limit > 0: - self._rate_limiter = AsyncLimiter( - max_rate=self._rate_limit, # 10 req/s default - time_period=self._rate_period # 1.0s - ) -``` - -**Analysis:** -- โ Optional rate limiting (can be disabled) -- โ Configurable per client -- โ Applied at engine level (unified across all scrapers) -- โ ๏ธ No burst handling (fixed rate) - -4. **Retry Logic with Backoff (Good)** -```python -async def retry_with_backoff( - func: Callable[[], Awaitable[T]], - max_retries: int = 3, - initial_delay: float = 1.0, - max_delay: float = 60.0, - backoff_factor: float = 2.0, -): - # Exponential backoff: 1s, 2s, 4s, ... -``` - -**Analysis:** -- โ Exponential backoff implemented -- โ Capped at max_delay (60s) -- โ ๏ธ No jitter (all clients retry at same time โ thundering herd) -- โ ๏ธ Fixed retryable exceptions (not circuit breaker) - -#### โ ๏ธ Performance Concerns - -1. **No Circuit Breaker Pattern** -```python -# Current: Retry 3x even if service is down -for attempt in range(max_retries + 1): - try: - return await func() - except Exception as e: - # Retries blindly even if 500+ errors - -# RECOMMENDATION: Add circuit breaker -class CircuitBreaker: - def __init__(self, failure_threshold=5, timeout=60): - self.failure_count = 0 - self.last_failure_time = None - self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN - - async def call(self, func): - if self.state == "OPEN": - if time.time() - self.last_failure_time > self.timeout: - self.state = "HALF_OPEN" - else: - raise CircuitBreakerOpen("Circuit breaker is open") - - try: - result = await func() - self.failure_count = 0 - self.state = "CLOSED" - return result - except Exception: - self.failure_count += 1 - if self.failure_count >= self.failure_threshold: - self.state = "OPEN" - self.last_failure_time = time.time() - raise -``` - -2. **No Connection Pool Metrics** -```python -# RECOMMENDATION: Expose connection pool stats -async def get_engine_stats(self) -> Dict[str, Any]: - """Get engine performance metrics.""" - connector = self._session.connector - return { - "total_connections": len(connector._conns), - "acquired_connections": len(connector._acquired), - "available_connections": len(connector._available), - "limit": connector._limit, - "limit_per_host": connector._limit_per_host, - } -``` - -3. **Polling Interval Not Adaptive** -```python -# Current: Fixed 10s polling interval -await asyncio.sleep(poll_interval) # Always 10s - -# RECOMMENDATION: Adaptive polling -class AdaptivePoller: - def __init__(self, min_interval=1, max_interval=30): - self.interval = min_interval - self.consecutive_not_ready = 0 - - async def wait(self): - await asyncio.sleep(self.interval) - self.consecutive_not_ready += 1 - # Exponential backoff for polling - self.interval = min( - self.interval * 1.5, - self.max_interval - ) - - def reset(self): - self.interval = self.min_interval - self.consecutive_not_ready = 0 -``` - ---- - -## ๐ก๏ธ Security & Error Handling - -### Grade: A (90/100) - -#### โ Strengths - -1. **Comprehensive Exception Hierarchy (Excellent)** -```python -BrightDataError (base) -โโโ ValidationError # Input validation -โโโ AuthenticationError # Auth/authorization -โโโ APIError # API failures (with status_code) -โโโ TimeoutError # Operation timeouts -โโโ ZoneError # Zone management -โโโ NetworkError # Network issues -โโโ SSLError # Certificate errors -``` - -**Analysis:** -- โ 7 specialized exception types -- โ Base exception captures message -- โ APIError includes status_code and response_text -- โ Clear error messages with actionable guidance - -2. **Input Validation (Excellent)** -```python -# Models have __post_init__ validation -def __post_init__(self) -> None: - if self.cost is not None and self.cost < 0: - raise ValueError(f"Cost must be non-negative, got {self.cost}") - if self.status not in ("ready", "error", "timeout", "in_progress"): - raise ValueError(f"Invalid status: {self.status}") -``` - -**Analysis:** -- โ Validation in dataclass __post_init__ -- โ Clear error messages -- โ Type hints enforce contracts -- โ URL validation in utils - -3. **SSL Error Handling (Good)** -```python -if is_ssl_certificate_error(e): - error_message = get_ssl_error_message(e) - raise SSLError(error_message) from e -``` - -**Analysis:** -- โ Detects SSL certificate errors -- โ Provides helpful message for macOS users -- โ Preserves exception chain (`from e`) - -#### โ ๏ธ Security Concerns - -1. **Token in Headers (Minor Risk)** -```python -headers={ - "Authorization": f"Bearer {self.bearer_token}", # Token in memory -} -``` - -**Recommendation:** -- Consider using `SecretStr` from Pydantic to prevent accidental logging -- Add warning if token is logged/printed - -2. **No Request/Response Sanitization** -```python -# RECOMMENDATION: Add sanitizer for logs -def sanitize_for_logging(data: Dict) -> Dict: - """Remove sensitive data from logs.""" - sanitized = data.copy() - sensitive_keys = ["authorization", "api_key", "token", "password"] - for key in sensitive_keys: - if key in sanitized: - sanitized[key] = "***REDACTED***" - return sanitized -``` - -3. **No Rate Limit Exhaustion Protection** -```python -# RECOMMENDATION: Add quota tracking -class QuotaTracker: - def __init__(self, daily_limit: int): - self.daily_limit = daily_limit - self.used_today = 0 - self.reset_at = datetime.now() + timedelta(days=1) - - def check_quota(self): - if datetime.now() >= self.reset_at: - self.used_today = 0 - self.reset_at = datetime.now() + timedelta(days=1) - - if self.used_today >= self.daily_limit: - raise QuotaExceededError( - f"Daily quota exceeded ({self.used_today}/{self.daily_limit})" - ) -``` - ---- - -## ๐ Code Quality - -### Grade: B+ (86/100) - -#### โ Strengths - -1. **Type Hints (Excellent)** -```python -# Comprehensive type definitions -from typing import Union, List, Optional, Dict, Any, Literal -from typing_extensions import NotRequired -from dataclasses import dataclass - -# TypedDict for payloads (305 lines of types!) -class AmazonProductPayload(TypedDict, total=False): - url: str # Required - reviews_count: NotRequired[int] -``` - -**Analysis:** -- โ 305 lines of TypedDict definitions -- โ NotRequired for optional fields -- โ Literal types for enums -- โ Generic types (TypeVar) in retry.py -- โ ๏ธ Some functions missing return type hints - -2. **Docstrings (Good)** -```python -""" -Scrape Amazon products from URLs (async). - -Uses standard async workflow: trigger job, poll until ready, then fetch results. - -Args: - url: Single product URL or list of product URLs (required) - timeout: Maximum wait time in seconds for polling (default: 240) - -Returns: - ScrapeResult or List[ScrapeResult] with product data - -Example: - >>> result = await scraper.products_async( - ... url="https://amazon.com/dp/B0CRMZHDG8", - ... timeout=240 - ... ) -""" -``` - -**Analysis:** -- โ Comprehensive docstrings -- โ Args, Returns, Raises sections -- โ Examples provided -- โ ๏ธ Not all functions have examples - -3. **Zero Technical Debt** -```bash -# Zero TODO/FIXME/HACK/XXX comments -grep -r "TODO\|FIXME\|HACK\|XXX" src/ -# 0 matches -``` - -**Analysis:** -- โ Clean codebase -- โ No deferred work -- โ No known bugs marked - -#### โ ๏ธ Quality Concerns - -1. **Inconsistent Naming** -```python -# Some methods use snake_case with _async suffix -async def products_async(self, ...) - -# Others don't -async def get_status(self, snapshot_id: str) -> str -``` - -**Recommendation:** -- Standardize on `*_async()` suffix for all async methods -- Keep sync wrappers without suffix: `products()` calls `products_async()` - -2. **Magic Numbers** -```python -limit=100, # Why 100? -limit_per_host=30, # Why 30? -max_delay: float = 60.0, # Why 60? -``` - -**Recommendation:** -```python -# Define constants -class ConnectionLimits: - TOTAL_CONNECTIONS = 100 # Based on OS limits - CONNECTIONS_PER_HOST = 30 # Prevent host overload - MAX_RETRY_DELAY = 60.0 # Reasonable upper bound - -connector = aiohttp.TCPConnector( - limit=ConnectionLimits.TOTAL_CONNECTIONS, - limit_per_host=ConnectionLimits.CONNECTIONS_PER_HOST, -) -``` - -3. **Large Files** -```python -# client.py: 592 lines -# Some classes could be split -``` - -**Recommendation:** -- Consider splitting BrightDataClient into: - - `BaseClient` (core functionality) - - `ClientServices` (service properties) - - `ClientZones` (zone management) - ---- - -## ๐งช Testing - -### Grade: B (82/100) - -#### โ Strengths - -1. **Comprehensive Test Coverage** -``` -tests/ -โโโ unit/ # 17 files - Unit tests -โโโ integration/ # 5 files - Integration tests -โโโ e2e/ # 4 files - End-to-end tests -โโโ fixtures/ # Mock data -โโโ samples/ # Sample responses -``` - -**Analysis:** -- โ 27 test files -- โ Multiple test levels (unit, integration, e2e) -- โ Fixtures and samples for testing -- โ Pytest with async support - -2. **Test Quality** -```python -# Good test structure -class TestClientInitialization: - def test_client_with_explicit_token(self): - def test_client_with_custom_config(self): - def test_client_loads_from_brightdata_api_token(self): - def test_client_raises_error_without_token(self): -``` - -**Analysis:** -- โ Organized by feature/class -- โ Descriptive test names -- โ Tests both success and error cases - -3. **AsyncEngine Sharing Test (Excellent)** -```python -def count_engines(): - """Count the number of AsyncEngine instances in memory.""" - gc.collect() - engines = [obj for obj in gc.get_objects() - if isinstance(obj, AsyncEngine)] - return len(engines) -``` - -**Analysis:** -- โ Verifies resource efficiency -- โ Tests backwards compatibility -- โ Clear pass/fail criteria - -#### โ ๏ธ Testing Gaps - -1. **No Load/Stress Tests** -```python -# RECOMMENDATION: Add performance tests -@pytest.mark.performance -async def test_concurrent_requests_performance(): - """Test 100 concurrent requests.""" - client = BrightDataClient(token="test") - - async with client: - tasks = [ - client.scrape.amazon.products(f"https://amazon.com/dp/{i}") - for i in range(100) - ] - results = await asyncio.gather(*tasks) - - assert all(r.success for r in results) - # Verify connection pool wasn't exhausted - assert len(results) == 100 -``` - -2. **No Chaos Engineering Tests** -```python -# RECOMMENDATION: Test failure scenarios -@pytest.mark.chaos -async def test_handles_network_failures_gracefully(): - """Test behavior under network failures.""" - # Simulate network failures - with patch('aiohttp.ClientSession.request') as mock: - mock.side_effect = aiohttp.ClientError("Network failure") - - client = BrightDataClient(token="test") - with pytest.raises(NetworkError): - await client.scrape.amazon.products(url="...") -``` - -3. **No Property-Based Tests** -```python -# RECOMMENDATION: Use Hypothesis -from hypothesis import given, strategies as st - -@given( - url=st.from_regex(r'https://amazon\.com/dp/[A-Z0-9]{10}'), - timeout=st.integers(min_value=1, max_value=600) -) -async def test_products_accepts_valid_inputs(url, timeout): - """Property-based test for input validation.""" - scraper = AmazonScraper(bearer_token="test") - # Should not raise for valid inputs - # (mock the API call) -``` - ---- - -## ๐ Documentation - -### Grade: B- (78/100) - -#### โ Strengths - -1. **Good Inline Documentation** -- โ Docstrings on all public methods -- โ Examples in docstrings -- โ Type hints act as documentation - -2. **Architecture Docs** -- โ `docs/architecture.md` exists -- โ Clear module structure - -#### โ ๏ธ Documentation Gaps - -1. **Missing API Reference** -``` -docs/ -โโโ architecture.md # โ Exists -โโโ quickstart.md # โ Exists -โโโ contributing.md # โ Exists -โโโ api-reference/ # โ ๏ธ Incomplete -โ โโโ ... # Only partial coverage -โโโ guides/ # โ ๏ธ Could be better -``` - -**Recommendation:** -- Auto-generate API docs from docstrings (Sphinx/MkDocs) -- Add more guides (error handling, advanced usage, best practices) - -2. **No Migration Guide** -- Users upgrading from 1.x need guidance -- AsyncEngine fix is internal but could affect advanced users - -3. **No Performance Tuning Guide** -```markdown -# RECOMMENDATION: docs/performance-tuning.md - -## Connection Pool Configuration -- Adjust `max_connections` based on workload -- Monitor connection pool exhaustion -- Use connection pool metrics - -## Rate Limiting Strategy -- Set appropriate rate limits per API -- Consider burst handling for bursty workloads -- Monitor rate limit headroom - -## Retry Configuration -- Tune backoff factors for your latency requirements -- Consider circuit breakers for failing services -- Add jitter to prevent thundering herd -``` - ---- - -## ๐ฏ FAANG Standards Comparison - -| Category | Current | FAANG Standard | Gap | -|----------|---------|----------------|-----| -| **Architecture** | Layered, DI | Microservices-ready | โ | -| **Async/Await** | Comprehensive | Required | โ | -| **Type Safety** | TypedDict, hints | Strict typing | โ | -| **Error Handling** | 7 exception types | Comprehensive | โ | -| **Logging** | Empty | Structured, correlated | โ | -| **Metrics** | None | Prometheus/StatsD | โ | -| **Tracing** | None | OpenTelemetry | โ | -| **Config Management** | Basic | Pydantic Settings | โ ๏ธ | -| **Testing** | 27 tests | >80% coverage + chaos | โ ๏ธ | -| **Documentation** | Good | Auto-generated + guides | โ ๏ธ | -| **CI/CD** | Unknown | GitHub Actions | โ | -| **Security** | Basic | SAST, DAST, SCA | โ ๏ธ | - ---- - -## ๐จ Critical Issues (Must Fix) - -### 1. **Empty Core Modules (P0)** -- `core/auth.py` - 1 line -- `core/logging.py` - 1 line -- `config.py` - 1 line -- `protocols.py` - 1 line - -**Impact:** Missing foundational infrastructure - -**Recommendation:** -- Implement structured logging with correlation IDs -- Add configuration management with Pydantic Settings -- Define protocols for extensibility -- Add authentication helpers - -### 2. **No Observability (P1)** -```python -# RECOMMENDATION: Add OpenTelemetry -from opentelemetry import trace -from opentelemetry.trace import Status, StatusCode - -tracer = trace.get_tracer(__name__) - -async def scrape_async(self, urls): - with tracer.start_as_current_span("scrape_async") as span: - span.set_attribute("url_count", len(urls)) - span.set_attribute("platform", self.PLATFORM_NAME) - - try: - result = await self._execute_scrape(urls) - span.set_status(Status(StatusCode.OK)) - return result - except Exception as e: - span.set_status(Status(StatusCode.ERROR, str(e))) - span.record_exception(e) - raise -``` - -### 3. **No Metrics Collection (P1)** -```python -# RECOMMENDATION: Add metrics -from prometheus_client import Counter, Histogram - -requests_total = Counter( - 'brightdata_requests_total', - 'Total requests', - ['method', 'platform', 'status'] -) - -request_duration = Histogram( - 'brightdata_request_duration_seconds', - 'Request duration', - ['method', 'platform'] -) - -async def scrape_async(self, urls): - start = time.time() - try: - result = await self._execute_scrape(urls) - requests_total.labels( - method='scrape', - platform=self.PLATFORM_NAME, - status='success' - ).inc() - return result - finally: - duration = time.time() - start - request_duration.labels( - method='scrape', - platform=self.PLATFORM_NAME - ).observe(duration) -``` - ---- - -## ๐ก Recommendations by Priority - -### P0 (Critical - Implement Immediately) -1. โ **Fix AsyncEngine duplication** - COMPLETED! -2. ๐ด **Implement structured logging** with correlation IDs -3. ๐ด **Add configuration management** via Pydantic Settings -4. ๐ด **Create comprehensive API documentation** - -### P1 (High Priority - Next Sprint) -5. ๐ก **Add observability** (OpenTelemetry integration) -6. ๐ก **Implement metrics collection** (Prometheus/StatsD) -7. ๐ก **Add circuit breaker pattern** to retry logic -8. ๐ก **Create performance tuning guide** - -### P2 (Medium Priority - Future) -9. ๐ข **Add load testing suite** -10. ๐ข **Implement adaptive polling** -11. ๐ข **Add chaos engineering tests** -12. ๐ข **Expose connection pool metrics** - -### P3 (Low Priority - Nice to Have) -13. โช **Add property-based tests** (Hypothesis) -14. โช **Create migration guides** -15. โช **Add quota tracking** -16. โช **Implement request sanitization** - ---- - -## ๐ Scoring Breakdown - -| Category | Weight | Score | Weighted | -|----------|--------|-------|----------| -| **Architecture** | 25% | 92/100 | 23.0 | -| **Performance** | 20% | 88/100 | 17.6 | -| **Security** | 15% | 90/100 | 13.5 | -| **Code Quality** | 15% | 86/100 | 12.9 | -| **Testing** | 10% | 82/100 | 8.2 | -| **Documentation** | 10% | 78/100 | 7.8 | -| **Observability** | 5% | 20/100 | 1.0 | -| **TOTAL** | **100%** | **-** | **84/100** | - -**Adjusted Grade:** A- (84/100) - ---- - -## ๐ Final Assessment - -### The Good โ -1. **Excellent async-first architecture** - Modern, scalable, efficient -2. **Strong type safety** - 305 lines of TypedDict definitions -3. **Comprehensive error handling** - 7 specialized exception types -4. **Clean dependency injection** - AsyncEngine sharing fix eliminates duplication -5. **Rich result models** - Validation, serialization, timing breakdown -6. **Good test coverage** - 27 test files across 3 levels - -### The Bad โ -1. **Missing observability** - No logging, metrics, or tracing -2. **Empty core modules** - auth.py, logging.py, config.py are stubs -3. **Limited configuration** - Hard-coded values, no environment-based config -4. **No load testing** - Unknown behavior under high load -5. **Documentation gaps** - Missing API reference, guides - -### The Ugly ๐ง -1. **No circuit breaker** - Retries blindly even when service is down -2. **No quota tracking** - Could exceed API limits -3. **Fixed polling intervals** - Not adaptive, wastes time -4. **No connection pool metrics** - Can't diagnose pool exhaustion - ---- - -## ๐ Comparison to Leading SDKs - -| Feature | Bright Data SDK | AWS SDK | Stripe SDK | Google Cloud SDK | -|---------|----------------|---------|------------|------------------| -| **Async-first** | โ | โ | โ | โ | -| **Type hints** | โ | โ | โ | โ | -| **Error hierarchy** | โ (7 types) | โ (20+ types) | โ (15+ types) | โ (30+ types) | -| **Structured logging** | โ | โ | โ | โ | -| **Metrics** | โ | โ | โ | โ | -| **Tracing** | โ | โ | โ | โ | -| **Circuit breaker** | โ | โ | โ | โ ๏ธ | -| **Retry with jitter** | โ ๏ธ | โ | โ | โ | -| **Config management** | โ ๏ธ | โ | โ | โ | -| **API versioning** | โ ๏ธ | โ | โ | โ | -| **Load testing** | โ | โ | โ | โ | - -**Verdict:** The Bright Data SDK is **architecturally sound** and on par with leading SDKs in core functionality, but **lacks enterprise observability** (logging, metrics, tracing) that FAANG companies consider mandatory. - ---- - -## ๐ฎ Path to A+ (95/100) - -To reach FAANG top-tier standards: - -1. **Implement full observability stack** (+8 points) - - Structured logging with correlation IDs - - Prometheus metrics integration - - OpenTelemetry tracing support - -2. **Add configuration management** (+3 points) - - Pydantic Settings for environment-based config - - Validation and defaults - - Configuration hot-reload support - -3. **Enhance testing** (+2 points) - - Load/stress tests - - Chaos engineering tests - - Property-based tests - -4. **Improve documentation** (+2 points) - - Auto-generated API reference - - Performance tuning guide - - Migration guides - -**Total potential:** 84 + 15 = **99/100** (A+) - ---- - -## โ๏ธ Conclusion - -The **Bright Data Python SDK is a well-architected, modern async-first SDK** that demonstrates strong engineering practices and is **ready for production use**. The recent AsyncEngine duplication fix shows commitment to continuous improvement. - -**Key Strengths:** -- Clean architecture with proper separation of concerns -- Excellent type safety and error handling -- Modern async/await patterns throughout -- Resource-efficient with shared engine - -**To reach FAANG top-tier (95+):** -- Add observability (logging, metrics, tracing) -- Implement configuration management -- Enhance testing (load, chaos, property-based) -- Complete documentation - -**Recommendation:** **APPROVED for production use** with P0 items (structured logging, config management) implemented within next 2 sprints. - ---- - -**Report Generated:** November 24, 2025 -**Next Review:** Q1 2026 -**Contact:** SDK Architecture Team - diff --git a/docs/architecture.md b/docs/architecture.md deleted file mode 100644 index 0ca6f34..0000000 --- a/docs/architecture.md +++ /dev/null @@ -1,2 +0,0 @@ -# Architecture Documentation - diff --git a/docs/contributing.md b/docs/contributing.md deleted file mode 100644 index a320bea..0000000 --- a/docs/contributing.md +++ /dev/null @@ -1,2 +0,0 @@ -# Contributing Guide - diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index 645951f..0000000 --- a/docs/index.md +++ /dev/null @@ -1,2 +0,0 @@ -# Bright Data Python SDK Documentation - diff --git a/docs/quickstart.md b/docs/quickstart.md deleted file mode 100644 index 0fe96ed..0000000 --- a/docs/quickstart.md +++ /dev/null @@ -1,2 +0,0 @@ -# Quick Start Guide - diff --git a/lastcheck.md b/lastcheck.md deleted file mode 100644 index c00284b..0000000 --- a/lastcheck.md +++ /dev/null @@ -1,400 +0,0 @@ -# Last Check - Critical Issues Found - -This document tracks critical issues discovered during final testing of the Bright Data SDK. - ---- - -## Issue 1: Incorrect Await in get_account_info Method - -**File:** `src/brightdata/client.py` (Line 339) - -### What is the issue? - -The `get_account_info` method incorrectly used `await` on a non-async method, causing a runtime error: -``` -object ResponseContextManager can't be used in 'await' expression -``` - -**Incorrect code:** -```python -async with await self.engine.get_from_url( - f"{self.engine.BASE_URL}/zone/get_active_zones" -) as zones_response: -``` - -The `engine.get_from_url()` method is not an async function - it returns a context manager directly, not a coroutine. Using `await` on it causes Python to try to await the context manager object itself, which fails. - -### What is the fix? - -Remove the extra `await` keyword: - -**Correct code:** -```python -async with self.engine.get_from_url( - f"{self.engine.BASE_URL}/zone/get_active_zones" -) as zones_response: -``` - -### Impact - -- **Severity:** High -- **Affected functionality:** Account information retrieval, zone listing, initial SDK setup -- **User impact:** Any code calling `client.get_account_info()` or `client.get_account_info_sync()` would fail with a runtime error -- **Discovery:** Found when running `test_02_list_zones.py` - -### Root Cause - -Confusion between async patterns. The developer likely thought `get_from_url()` was an async method that needed to be awaited, but it's actually a regular method that returns an async context manager. - -### Similar Code Patterns Checked - -- `test_connection()` method (Line 297): โ Correctly implemented without extra `await` -- Other uses of `engine.get_from_url()`: None found in client.py - -### Testing - -After fix: -```bash -python probe_tests/test_02_list_zones.py -# Should now successfully list zones without the await error -``` - ---- - -### Verification - -After applying the fix, the test runs successfully: -``` -โ Client initialized successfully -โ Token Valid: True -โ API call succeeds without await error -``` - -If you see "0 zones found", this is correct behavior - it means your Bright Data account doesn't have zones configured yet. You need to create zones in the Bright Data dashboard. - ---- - -## Issue 2: Zones Not Showing - get_active_zones Returns Empty Array - -**File:** `src/brightdata/client.py` (get_account_info method) - -### What is the issue? - -The SDK uses `/zone/get_active_zones` endpoint which only returns **active** zones. If all your zones are inactive (as shown in Bright Data dashboard), the API returns an empty array `[]`. - -**Current behavior:** -- Endpoint: `/zone/get_active_zones` -- Returns: `[]` (empty array) when zones are inactive -- User's zones: `residential_proxy1` (Inactive), `web_unlocker1` (status unknown) - -### What is the fix? - -Multiple options: - -1. **Activate zones in Bright Data dashboard** (User action) - - Go to https://brightdata.com - - Activate the zones you want to use - - Zones will then appear in API response - -2. **Use a different endpoint** (SDK fix - if available) - - Need to find endpoint that returns ALL zones (not just active) - - Current testing shows no such endpoint is publicly available - -3. **Add warning message** (SDK improvement) - ```python - if not zones: - print("No active zones found. Please check:") - print("1. Your zones might be inactive - activate them in dashboard") - print("2. You might need to create zones first") - ``` - -### Impact - -- **Severity:** Medium -- **Affected functionality:** Zone discovery, automatic configuration -- **User impact:** Users with inactive zones see "0 zones" even though zones exist -- **Discovery:** Found when testing with account that has inactive zones - -### Root Cause - -The API endpoint name `get_active_zones` is explicit - it only returns active zones. This is by design but not clearly communicated to users. - -### Workaround - -For testing without active zones, manually specify zone names: -```python -client = BrightDataClient( - web_unlocker_zone="web_unlocker1", # Use your actual zone name - serp_zone="your_serp_zone", - browser_zone="your_browser_zone" -) -``` - -### Resolution Confirmed - -User created a new active zone `web_unlocker2` and it immediately appeared in the API response: -```json -[ - { - "name": "web_unlocker2", - "type": "unblocker" - } -] -``` - -This confirms the SDK is working correctly - it accurately reports only **active** zones as intended by the API design. - ---- - -## Issue 3: Inactive Zones Not Listed - No Clarity on Zone Deactivation - -**File:** `src/brightdata/client.py` (get_account_info method using `/zone/get_active_zones`) - -### What is the issue? - -The SDK only shows active zones but provides no visibility into: -1. **Inactive zones that exist** - Users have zones but can't see them via API -2. **Why zones become inactive** - No explanation of deactivation triggers -3. **How to reactivate zones** - No programmatic way to activate zones -4. **Zone state transitions** - When/why zones change from active to inactive - -**User Experience Problem:** -- User has zones (`residential_proxy1`, `web_unlocker1`) visible in dashboard -- SDK returns empty array, making it seem like no zones exist -- No indication that zones are present but inactive -- No information about why zones are inactive - -### Common Reasons Zones Become Inactive (Not Documented): - -1. **No usage for extended period** - Zones auto-deactivate after inactivity -2. **Payment issues** - Billing problems may deactivate zones -3. **Manual deactivation** - User or admin deactivated in dashboard -4. **Service changes** - Plan changes might affect zone status -5. **Initial setup** - New zones might start as inactive - -### What is the fix? - -**Short term:** -- Add better error messages indicating inactive zones might exist -- Document that only active zones are returned -- Suggest checking dashboard for inactive zones - -**Long term (API improvements needed):** -- Provide endpoint to list ALL zones with status -- Include deactivation reason in zone data -- Add zone activation/deactivation endpoints -- Return inactive zone count even if not listing them - -### Impact - -- **Severity:** High for user experience -- **Affected functionality:** Zone discovery, initial setup, debugging -- **User confusion:** Users think zones don't exist when they're just inactive -- **Discovery:** Found when user had 2 zones in dashboard but API returned 0 - -### Root Cause - -The API design assumes users know: -1. Only active zones are returned -2. Zones can be inactive -3. Dashboard shows all zones but API doesn't -4. Manual dashboard intervention needed for activation - -This creates a disconnect between dashboard visibility and API visibility. - -### Recommendations - -1. **Rename endpoint** to be clearer: `/zone/get_active_zones` โ clearly indicates active only -2. **Add companion endpoint**: `/zone/get_all_zones` with status field -3. **Improve error messages**: When 0 zones returned, mention checking for inactive zones -4. **Add zone status to SDK**: Method to check zone states and activation requirements - ---- - -## Issue 4: Incorrect Default SERP Zone Name - -**File:** `src/brightdata/client.py` (Line 65) - -### What is the issue? - -The SDK uses `sdk_serp` as the default SERP zone name, but Bright Data's actual SERP zone naming convention is `serp_api1` (or similar patterns like `serp_api2`, etc.). - -**Incorrect default:** -```python -DEFAULT_SERP_ZONE = "sdk_serp" -``` - -**Correct default:** -```python -DEFAULT_SERP_ZONE = "serp_api1" -``` - -### Impact - -- **Severity:** Medium -- **Affected functionality:** SERP API calls (Google, Bing, Yandex search) -- **User impact:** SERP tests fail with "zone 'sdk_serp' not found" error -- **Discovery:** Found when running `test_04_serp_google.py` - -### Root Cause - -The SDK developers used a generic placeholder name `sdk_serp` instead of following Bright Data's actual naming conventions for zones. The same issue exists for other default zones: -- `sdk_unlocker` should follow pattern like `web_unlocker1` -- `sdk_browser` should follow pattern like `browser_api1` - -### Testing - -After fix: -```bash -python probe_tests/test_04_serp_google.py -# Should now look for "serp_api1" zone instead of "sdk_serp" -``` - -### Similar Issues - -The SDK has similar incorrect defaults: -- `DEFAULT_WEB_UNLOCKER_ZONE = "sdk_unlocker"` (should be like `web_unlocker1`) -- `DEFAULT_BROWSER_ZONE = "sdk_browser"` (should be like `browser_api1`) - -These defaults don't match Bright Data's actual zone naming patterns. - ---- - -## Issue 5: SERP SDK Implementation Missing Key Components - -**Files:** Multiple files in `src/brightdata/api/serp/` - -### What is the issue? - -The SDK's SERP implementation has fundamental issues: - -1. **Wrong endpoint**: Using `/request` endpoint (for Web Unlocker) instead of SERP-specific endpoint -2. **Wrong response format**: SERP zone returns raw HTTP response with HTML body, not parsed JSON -3. **Missing HTML parser**: SDK expects structured data but gets HTML, has no parser to extract results - -**Actual API response:** -```json -{ - "status_code": 200, - "headers": {...}, - "body": "..." -} -``` - -**What SDK expects:** -```json -{ - "organic": [ - { - "title": "Python Programming", - "url": "https://...", - "description": "..." - } - ], - "ads": [...], - "featured_snippet": {...} -} -``` - -### Impact - -- **Severity:** Critical - SERP API is completely non-functional -- **Affected functionality:** All SERP API searches (Google, Bing, Yandex) -- **User impact:** SERP features advertised in README don't work at all -- **Discovery:** Found when running `test_04_serp_google.py` - -### Root Cause Analysis - -The SDK has fundamental misunderstandings about how Bright Data's SERP API works: - -1. **Wrong endpoint**: The SDK uses `/request` endpoint with `payload = {"zone": zone, "url": search_url, "format": "json", "method": "GET"}`. This is the Web Unlocker API format, not SERP API. - -2. **SERP zones work differently**: SERP zones (`type: serp`) return raw HTML responses wrapped in HTTP response structure. They're designed to fetch search results HTML, not parse it. - -3. **Missing parsing layer**: Other SERP SDKs either: - - Use a different endpoint that returns parsed data - - Include HTML parsers to extract structured data from raw HTML - - Use Bright Data's parsing service (if available) - -### Testing - -```bash -python probe_tests/test_04_serp_google.py -# Shows HTML being returned in body field -``` - -### Solution Options - -1. **Find correct SERP endpoint**: Bright Data might have a `/serp` or similar endpoint that returns parsed results -2. **Add HTML parsing**: Use BeautifulSoup or similar to parse Google/Bing/Yandex HTML -3. **Use different zone type**: There might be a parsed SERP zone type -4. **Add parser parameter**: Maybe `{"parser": true}` or similar enables parsing - -### Current Workaround - -None - SERP API is non-functional in current SDK implementation - ---- - -## Issue 6: SDK Expects Parsed SERP Data But API Returns Raw HTML - -**File:** `src/brightdata/api/serp/data_normalizer.py` (Line 78+) - -### What is the issue? - -The SDK's GoogleDataNormalizer expects the SERP API to return parsed JSON with specific fields, but the API actually returns raw HTML. - -**SDK expects (data_normalizer.py lines 78-105):** -```python -# Line 78: Expects 'organic' field with search results -organic = data.get("organic", []) - -# Lines 80-87: Expects each result to have these fields -for i, item in enumerate(organic, 1): - results.append({ - "position": i, - "title": item.get("title", ""), - "url": item.get("url", ""), - "description": item.get("description", ""), - "displayed_url": item.get("displayed_url", ""), - }) - -# Lines 91-105: Expects these optional fields -"total_results": data.get("total_results") -"search_information": data.get("search_information", {}) -"featured_snippet": data.get("featured_snippet") -"knowledge_panel": data.get("knowledge_panel") -"people_also_ask": data.get("people_also_ask") -"related_searches": data.get("related_searches") -"ads": data.get("ads") -``` - -**API actually returns:** -```json -{ - "status_code": 200, - "headers": {...}, - "body": "..." // Raw HTML, no parsed fields -} -``` - -### Impact - -- **Severity:** Critical -- **Affected functionality:** All SERP normalizers expect parsed data -- **User impact:** SERP API always returns 0 results because normalizer can't find expected fields -- **Discovery:** Found in `src/brightdata/api/serp/data_normalizer.py` - -### Root Cause - -The SDK was designed assuming the SERP API would return parsed/structured JSON data with fields like `organic`, `ads`, `featured_snippet`, etc. However, Bright Data's SERP zones return raw HTML that needs to be parsed to extract these fields. - -### Testing - -Running the test shows the mismatch: -```bash -python probe_tests/test_04_serp_google.py -# Debug output shows: "SERP API returned JSON with keys: ['status_code', 'headers', 'body']" -# Not the expected: ['organic', 'ads', 'featured_snippet', ...] -``` - diff --git a/src/brightdata/api/search_service.py b/src/brightdata/api/search_service.py index 0a11c4d..e3e54ae 100644 --- a/src/brightdata/api/search_service.py +++ b/src/brightdata/api/search_service.py @@ -226,8 +226,8 @@ def linkedin(self): >>> # Discover posts from profile >>> result = client.search.linkedin.posts( ... profile_url="https://linkedin.com/in/johndoe", - ... start_date="2024-01-01", - ... end_date="2024-12-31" + ... start_date="2025-01-01", + ... end_date="2025-12-31" ... ) >>> >>> # Find profiles by name @@ -302,8 +302,8 @@ def instagram(self): >>> result = client.search.instagram.reels( ... url="https://instagram.com/username", ... num_of_posts=50, - ... start_date="01-01-2024", - ... end_date="12-31-2024" + ... start_date="01-01-2025", + ... end_date="12-31-2025" ... ) """ if self._instagram_search is None: diff --git a/src/brightdata/client.py b/src/brightdata/client.py index ea12630..5cfa222 100644 --- a/src/brightdata/client.py +++ b/src/brightdata/client.py @@ -64,9 +64,9 @@ class BrightDataClient: # Default configuration DEFAULT_TIMEOUT = 30 - DEFAULT_WEB_UNLOCKER_ZONE = "web_unlocker1" - DEFAULT_SERP_ZONE = "serp_api1" - DEFAULT_BROWSER_ZONE = "browser_api1" + DEFAULT_WEB_UNLOCKER_ZONE = "sdk_unlocker" + DEFAULT_SERP_ZONE = "sdk_serp" + DEFAULT_BROWSER_ZONE = "sdk_browser" # Environment variable name for API token TOKEN_ENV_VAR = "BRIGHTDATA_API_TOKEN" @@ -79,7 +79,7 @@ def __init__( web_unlocker_zone: Optional[str] = None, serp_zone: Optional[str] = None, browser_zone: Optional[str] = None, - auto_create_zones: bool = False, + auto_create_zones: bool = True, validate_token: bool = False, rate_limit: Optional[float] = None, rate_period: float = 1.0, @@ -95,10 +95,10 @@ def __init__( (supports .env files via python-dotenv) customer_id: Customer ID (optional, can also be set via BRIGHTDATA_CUSTOMER_ID) timeout: Default timeout in seconds for all requests (default: 30) - web_unlocker_zone: Zone name for web unlocker (default: "web_unlocker1") - serp_zone: Zone name for SERP API (default: "serp_api1") - browser_zone: Zone name for browser API (default: "browser_api1") - auto_create_zones: Automatically create zones if they don't exist (default: False) + web_unlocker_zone: Zone name for web unlocker (default: "sdk_unlocker") + serp_zone: Zone name for SERP API (default: "sdk_serp") + browser_zone: Zone name for browser API (default: "sdk_browser") + auto_create_zones: Automatically create zones if they don't exist (default: True) validate_token: Validate token by testing connection on init (default: False) rate_limit: Maximum requests per rate_period (default: 10). Set to None to disable. rate_period: Time period in seconds for rate limit (default: 1.0) diff --git a/src/brightdata/payloads.py b/src/brightdata/payloads.py index c2f1130..922dc7d 100644 --- a/src/brightdata/payloads.py +++ b/src/brightdata/payloads.py @@ -402,8 +402,8 @@ class LinkedInPostSearchPayload(URLPayload): Example: >>> payload = LinkedInPostSearchPayload( ... profile_url="https://linkedin.com/in/johndoe", - ... start_date="2024-01-01", - ... end_date="2024-12-31" + ... start_date="2025-01-01", + ... end_date="2025-12-31" ... ) """ @@ -499,7 +499,7 @@ class FacebookPostsProfilePayload(URLPayload): >>> payload = FacebookPostsProfilePayload( ... url="https://facebook.com/profile", ... num_of_posts=10, - ... start_date="01-01-2024" + ... start_date="01-01-2025" ... ) """ diff --git a/src/brightdata/scrapers/amazon/scraper.py b/src/brightdata/scrapers/amazon/scraper.py index 4592077..48ef9ed 100644 --- a/src/brightdata/scrapers/amazon/scraper.py +++ b/src/brightdata/scrapers/amazon/scraper.py @@ -254,7 +254,29 @@ async def reviews_async( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + url_list = url if isinstance(url, list) else [url] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform=result.platform, + method=result.method, + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result def reviews( @@ -499,5 +521,26 @@ async def _scrape_urls( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform=result.platform, + method=result.method, + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result diff --git a/src/brightdata/scrapers/base.py b/src/brightdata/scrapers/base.py index 64a97ca..277dd67 100644 --- a/src/brightdata/scrapers/base.py +++ b/src/brightdata/scrapers/base.py @@ -154,9 +154,32 @@ async def scrape_async( ) if is_single and isinstance(result.data, list) and len(result.data) == 1: + # Single URL case - unwrap single item from list result.url = urls result.data = result.data[0] return result + elif not is_single and isinstance(result.data, list): + # Multiple URLs case - transform to List[ScrapeResult] + results = [] + for i, (url, data_item) in enumerate(zip(url_list, result.data)): + individual_result = ScrapeResult( + success=True, + data=data_item, + url=url, + error=None, + platform=result.platform, + method=result.method, + # Copy timing information from parent + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + # Divide cost equally across results + cost=result.cost / len(result.data) if result.cost else None, + ) + results.append(individual_result) + return results return result diff --git a/src/brightdata/scrapers/facebook/scraper.py b/src/brightdata/scrapers/facebook/scraper.py index 8e0a4ac..54b0577 100644 --- a/src/brightdata/scrapers/facebook/scraper.py +++ b/src/brightdata/scrapers/facebook/scraper.py @@ -97,8 +97,8 @@ async def posts_by_profile_async( >>> result = await scraper.posts_by_profile_async( ... url="https://facebook.com/profile", ... num_of_posts=10, - ... start_date="01-01-2024", - ... end_date="12-31-2024", + ... start_date="01-01-2025", + ... end_date="12-31-2025", ... timeout=240 ... ) """ @@ -431,8 +431,8 @@ async def comments_async( >>> result = await scraper.comments_async( ... url="https://facebook.com/post/123456", ... num_of_comments=100, - ... start_date="01-01-2024", - ... end_date="12-31-2024", + ... start_date="01-01-2025", + ... end_date="12-31-2025", ... timeout=240 ... ) """ @@ -669,7 +669,28 @@ async def _scrape_urls( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform=result.platform, + method=result.method, + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result async def _scrape_with_params( @@ -737,5 +758,27 @@ async def _scrape_with_params( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + url_list = url if isinstance(url, list) else [url] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform=result.platform, + method=result.method, + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result diff --git a/src/brightdata/scrapers/instagram/scraper.py b/src/brightdata/scrapers/instagram/scraper.py index ee49435..27f699b 100644 --- a/src/brightdata/scrapers/instagram/scraper.py +++ b/src/brightdata/scrapers/instagram/scraper.py @@ -442,5 +442,26 @@ async def _scrape_urls( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform=result.platform, + method=result.method, + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result diff --git a/src/brightdata/scrapers/instagram/search.py b/src/brightdata/scrapers/instagram/search.py index 6d48d04..4381769 100644 --- a/src/brightdata/scrapers/instagram/search.py +++ b/src/brightdata/scrapers/instagram/search.py @@ -94,8 +94,8 @@ async def posts_async( >>> result = await scraper.posts_async( ... url="https://instagram.com/username", ... num_of_posts=10, - ... start_date="01-01-2024", - ... end_date="12-31-2024", + ... start_date="01-01-2025", + ... end_date="12-31-2025", ... post_type="reel" ... ) """ @@ -175,8 +175,8 @@ async def reels_async( >>> result = await scraper.reels_async( ... url="https://instagram.com/username", ... num_of_posts=50, - ... start_date="01-01-2024", - ... end_date="12-31-2024", + ... start_date="01-01-2025", + ... end_date="12-31-2025", ... timeout=240 ... ) """ @@ -283,5 +283,26 @@ async def _discover_with_params( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + url_list = url if isinstance(url, list) else [url] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform="instagram", + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result diff --git a/src/brightdata/scrapers/linkedin/scraper.py b/src/brightdata/scrapers/linkedin/scraper.py index 220f10f..0eb3e49 100644 --- a/src/brightdata/scrapers/linkedin/scraper.py +++ b/src/brightdata/scrapers/linkedin/scraper.py @@ -425,5 +425,26 @@ async def _scrape_urls( if is_single and isinstance(result.data, list) and len(result.data) == 1: result.url = url if isinstance(url, str) else url[0] result.data = result.data[0] - + return result + elif not is_single and isinstance(result.data, list): + from ...models import ScrapeResult + + results = [] + for url_item, data_item in zip(url_list, result.data): + results.append( + ScrapeResult( + success=True, + data=data_item, + url=url_item, + platform=result.platform, + method=result.method, + trigger_sent_at=result.trigger_sent_at, + snapshot_id_received_at=result.snapshot_id_received_at, + snapshot_polled_at=result.snapshot_polled_at, + data_fetched_at=result.data_fetched_at, + snapshot_id=result.snapshot_id, + cost=result.cost / len(result.data) if result.cost else None, + ) + ) + return results return result diff --git a/src/brightdata/scrapers/linkedin/search.py b/src/brightdata/scrapers/linkedin/search.py index ec70652..352a795 100644 --- a/src/brightdata/scrapers/linkedin/search.py +++ b/src/brightdata/scrapers/linkedin/search.py @@ -87,8 +87,8 @@ async def posts_async( Example: >>> result = await search.posts_async( ... profile_url="https://linkedin.com/in/johndoe", - ... start_date="2024-01-01", - ... end_date="2024-12-31" + ... start_date="2025-01-01", + ... end_date="2025-12-31" ... ) """ # Normalize to lists diff --git a/tests/enes/chatgpt.py b/tests/enes/chatgpt.py index 7a84b2f..3088863 100644 --- a/tests/enes/chatgpt.py +++ b/tests/enes/chatgpt.py @@ -79,12 +79,12 @@ async def test_chatgpt_web_search(): async with client.engine: scraper = client.scrape.chatgpt print("\n๐ Testing ChatGPT with web search...") - print("๐ Prompt: 'What are the latest developments in AI in 2024?'") + print("๐ Prompt: 'What are the latest developments in AI in 2025?'") print("๐ Web search: Enabled") try: result = await scraper.prompt_async( - prompt="What are the latest developments in AI in 2024?", + prompt="What are the latest developments in AI in 2025?", web_search=True, poll_timeout=180, ) diff --git a/tests/enes/chatgpt_02.py b/tests/enes/chatgpt_02.py index af5918d..cabf2db 100644 --- a/tests/enes/chatgpt_02.py +++ b/tests/enes/chatgpt_02.py @@ -67,7 +67,7 @@ async def test_chatgpt(): # Test 2: Prompt with web search print("\n2. Testing prompt with web search...") try: - prompt = "What are the latest AI developments in 2024?" + prompt = "What are the latest AI developments in 2025?" print(f" Prompt: '{prompt}'") print(" Web search: True") print(" Country: US") diff --git a/tests/integration/test_client_integration.py b/tests/integration/test_client_integration.py index 719c0b3..6f191b4 100644 --- a/tests/integration/test_client_integration.py +++ b/tests/integration/test_client_integration.py @@ -54,7 +54,7 @@ async def test_connection_with_valid_token(self, async_client): @pytest.mark.asyncio async def test_connection_with_invalid_token(self): """Test connection returns False with invalid token.""" - client = BrightDataClient(token="invalid_token_123456789") + client = BrightDataClient(token="invalid_token_123456789", auto_create_zones=False) async with client: # test_connection() never raises - returns False for invalid tokens @@ -104,7 +104,7 @@ async def test_get_account_info_returns_zones(self, async_client): @pytest.mark.asyncio async def test_get_account_info_with_invalid_token(self): """Test getting account info fails with invalid token.""" - client = BrightDataClient(token="invalid_token_123456789") + client = BrightDataClient(token="invalid_token_123456789", auto_create_zones=False) async with client: with pytest.raises(AuthenticationError) as exc_info: @@ -192,7 +192,7 @@ class TestClientErrorHandling: @pytest.mark.asyncio async def test_connection_test_returns_false_on_network_error(self): """Test connection test returns False (not exception) on network errors.""" - client = BrightDataClient(token="test_token_123456789") + client = BrightDataClient(token="test_token_123456789", auto_create_zones=False) async with client: # Should return False, not raise exception diff --git a/tests/readme.py b/tests/readme.py index 7462aaf..f47ab89 100644 --- a/tests/readme.py +++ b/tests/readme.py @@ -491,16 +491,16 @@ def test_facebook_posts_by_profile(self, client): # result = client.scrape.facebook.posts_by_profile( # url="https://facebook.com/profile", # num_of_posts=10, - # start_date="01-01-2024", - # end_date="12-31-2024", + # start_date="01-01-2025", + # end_date="12-31-2025", # timeout=240 # ) result = client.scrape.facebook.posts_by_profile( url="https://facebook.com/zuck", num_of_posts=5, - start_date="01-01-2024", - end_date="12-31-2024", + start_date="01-01-2025", + end_date="12-31-2025", timeout=240, ) @@ -579,8 +579,8 @@ def test_instagram_post_discovery(self, client): # result = client.search.instagram.posts( # url="https://instagram.com/username", # num_of_posts=10, - # start_date="01-01-2024", - # end_date="12-31-2024", + # start_date="01-01-2025", + # end_date="12-31-2025", # post_type="reel", # timeout=240 # ) @@ -588,8 +588,8 @@ def test_instagram_post_discovery(self, client): result = client.search.instagram.posts( url="https://instagram.com/instagram", num_of_posts=5, - start_date="01-01-2024", - end_date="12-31-2024", + start_date="01-01-2025", + end_date="12-31-2025", post_type="reel", timeout=240, ) diff --git a/tests/samples/amazon/product.json b/tests/samples/amazon/product.json index 672404c..7ca848c 100644 --- a/tests/samples/amazon/product.json +++ b/tests/samples/amazon/product.json @@ -31,7 +31,7 @@ "product_dimensions": "10\"W x 13.25\"H", "seller_id": "A62ZX0SLNJGAO", "image": "https://m.media-amazon.com/images/I/61Q4eGZWFSL._AC_SL1500_.jpg", - "date_first_available": "March 11, 2024", + "date_first_available": "March 11, 2025", "model_number": "Stanley Quencher H2.O FlowState\u2122 Tumbler 40 oz Fuchsia", "manufacturer": "Stanley", "department": "Home & Kitchen", @@ -211,7 +211,7 @@ }, { "type": "Date First Available", - "value": "March 11, 2024" + "value": "March 11, 2025" }, { "type": "Brand", diff --git a/tests/samples/chatgpt/prompt.json b/tests/samples/chatgpt/prompt.json index e32e13a..9cb4fa3 100644 --- a/tests/samples/chatgpt/prompt.json +++ b/tests/samples/chatgpt/prompt.json @@ -2,7 +2,7 @@ { "url": "https://chatgpt.com/?model=gpt-4&q=Explain%20Python%20in%20one%20sentence", "prompt": "Explain Python in one sentence", - "answer_html": "\n
\n\n\nPython is a high-level, easy-to-read programming language that lets you write powerful software quickly with clear, expressive code.
\nPython is a high-level, easy-to-read programming language that lets you write powerful software quickly with clear, expressive code.
\n