How To Create An Advanced Brightdata Web Scraper With Google Gemini For Data Extraction Powered By AI

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

In this tutorial, we guide you through the creation of an improved web scratching tool that operates BrightThe powerful proxy network is alongside Google's Gemini API for intelligent data extraction. You will see how to structure your Python project, install and import the necessary libraries and encapsulate the scratch logic in a clean and reusable brightdatascraper class. Whether you target the Amazon products pages, best-sellers or LinkedIn profiles, modular scratch methods show how to configure scratch parameters, manage errors graciously and return the structured JSON results. An optional integration of the AI -style agent of style also shows you how to combine the reasoning focused on LLM with scratch in real time, allowing you to pose requests in natural language for the analysis of data on the fly.

!pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai

We install all the key libraries necessary for the tutorial in one step: Langchain-Brightdata for BrightData Web Scring, Langchain-Google-Genai and Google-Generativateai for integration Google Gemini, Langgraph for the orchestration of agents and Langchain-Core for the Langchain Core frame.

import os
import json
from typing import Dict, Any, Optional
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent

These imports prepare your environment and your basic features: OS and JSON manage system operations and data serialization, while the noise provides structured type advice. You then bring brightdatawebscraperapi for brightdata stracing, chatgooglegenerativative to interface with gemini llm of Google and create_rect_agent to orchestrate these components in a reactive style agent.

class BrightDataScraper:
    """Enhanced web scraper using BrightData API"""
   
    def __init__(self, api_key: str, google_api_key: Optional(str) = None):
        """Initialize scraper with API keys"""
        self.api_key = api_key
        self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
       
        if google_api_key:
            self.llm = ChatGoogleGenerativeAI(
                model="gemini-2.0-flash",
                google_api_key=google_api_key
            )
            self.agent = create_react_agent(self.llm, (self.scraper))
   
    def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict(str, Any):
        """Scrape Amazon product data"""
        try:
            results = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product",
                "zipcode": zipcode
            })
            return {"success": True, "data": results}
        except Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_amazon_bestsellers(self, region: str = "in") -> Dict(str, Any):
        """Scrape Amazon bestsellers"""
        try:
            url = f"https://www.amazon.{region}/gp/bestsellers/"
            results = self.scraper.invoke({
                "url": url,
                "dataset_type": "amazon_product"
            })
            return {"success": True, "data": results}
        except Exception as e:
            return {"success": False, "error": str(e)}
   
    def scrape_linkedin_profile(self, url: str) -> Dict(str, Any):
        """Scrape LinkedIn profile data"""
        try:
            results = self.scraper.invoke({
                "url": url,
                "dataset_type": "linkedin_person_profile"
            })
            return {"success": True, "data": results}
        except Exception as e:
            return {"success": False, "error": str(e)}
   
    def run_agent_query(self, query: str) -> None:
        """Run AI agent with natural language query"""
        if not hasattr(self, 'agent'):
            print("Error: Google API key required for agent functionality")
            return
       
        try:
            for step in self.agent.stream(
                {"messages": query},
                stream_mode="values"
            ):
                step("messages")(-1).pretty_print()
        except Exception as e:
            print(f"Agent error: {e}")
   
    def print_results(self, results: Dict(str, Any), title: str = "Results") -> None:
        """Pretty print results"""
        print(f"\n{'='*50}")
        print(f"{title}")
        print(f"{'='*50}")
       
        if results("success"):
            print(json.dumps(results("data"), indent=2, ensure_ascii=False))
        else:
            print(f"Error: {results('error')}")
        print()

The BrightDatascraper class sums up all the logic of brightdata web crampons and the optional intelligence powered by Gemini under a single reusable interface. Its methods allow you to easily recover the details of the Amazon product, best-sellers lists and LinkedIn profiles, API call management, errors management and JSON formatting, and even disseminate “agent” requests in natural language when a Google API key is provided. Print_results practical help guarantees that your output is always formatted proper for the inspection.

def main():
    """Main execution function"""
    BRIGHT_DATA_API_KEY = "Use Your Own API Key"
    GOOGLE_API_KEY = "Use Your Own API Key"
   
    scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
   
    print("🛍️ Scraping Amazon India Bestsellers...")
    bestsellers = scraper.scrape_amazon_bestsellers("in")
    scraper.print_results(bestsellers, "Amazon India Bestsellers")
   
    print("📦 Scraping Amazon Product...")
    product_url = "https://www.amazon.com/dp/B08L5TNJHG"
    product_data = scraper.scrape_amazon_product(product_url, "10001")
    scraper.print_results(product_data, "Amazon Product Data")
   
    print("👤 Scraping LinkedIn Profile...")
    linkedin_url = "https://www.linkedin.com/in/satyanadella/"
    linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
    scraper.print_results(linkedin_data, "LinkedIn Profile Data")
   
    print("🤖 Running AI Agent Query...")
    agent_query = """
    Scrape Amazon product data for https://www.amazon.com/dp/B0D2Q9397Y?th=1
    in New York (zipcode 10001) and summarize the key product details.
    """
    scraper.run_agent_query(agent_query)

The main function () attaches everything together by defining your brightdata and Google API keys, instantly instantling brightdatascraper, then demonstrating each functionality: it scratches the bestsellers of Amazon India, recovers details for a specific product, recovers a Linkedin profile, and finally manages a natural language request, the priation carefully formatted after each stage.

if __name__ == "__main__":
    print("Installing required packages...")
    os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")
   
    os.environ("BRIGHT_DATA_API_KEY") = "Use Your Own API Key"
   
    main()

Finally, this block of entry point guarantees that, when executed as an autonomous script, the required scraping libraries are installed quietly and the API BrightData key is defined in the environment. Then, the main function is performed to initiate all scratching and agent workflows.

In conclusion, at the end of this tutorial, you will have a ready -to -use Python script which automates the tedious data collection tasks, summarizes the details of the low level API and possibly draws from a generative AI for advanced manipulation of the request. You can extend this foundation by adding the support of other types of data games, integrating additional LLM or in deployment of the scraper as part of a more important data pipeline. With these constitutive elements in place, you are now equipped to collect, analyze and present web data more efficiently, whether for market studies, competitive intelligence or personalized AI applications.

Discover the Notebook. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About

How to create an advanced brightdata web scraper with Google Gemini for data extraction powered by AI