A Asynchronous Web Data Extraction Coding Guide Using Crawl4ai: A Toolbox On The Open-source Web Rampant Designed For LLM Workflows

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

In this tutorial, we show how to exploit Crawl4aA modern web toolbox based on Python, to extract structured data from web pages directly in Google Colar. Pulling the power of Asyncio for Asynchronous E / S, HTTPX for HTTP requests, and Crawl4Ai asynchtttpcrawlerstrategy from Crawl4ai, we go around the head of without head browsers while analyzing HTML Complex via JSONCSEXTRACTIVES. With only a few lines of code, you install the outbuildings (crawl4ai, httpx), configure httpcrawconfig to require only gzip / deflate (avoid brotli problems), define your css-to-json scheme and orchestrate the crawl via asyncwebcrawler and crawerrunconfig. Finally, the JSON extracted data is loaded in pandas for immediate analysis or export.

What distinguishes Crawl4a is his unified API, which changes in a transparent manner between the strategies based on the browser (playwright) and HTTP only, its robust error hooks and its declarative extraction patterns. Unlike traditional workflows without Heads-Browser, Crawl4a allows you to choose the lightest and most efficient backend, which makes it ideal for evolutionary data pipelines, ETLs on the fly in portable computers or LLM feeding and analysis tools with clean JSON / CSV outputs.

!pip install -U crawl4ai httpx

First of all, we set up (or upgrade) Crawl4ai, the basic asynchronous crawling frame, alongside HTTPX. This high performance HTTP customer provides all the constituent elements we need for light and asynchronous web scratching directly in Colab.

import asyncio, json, pandas as pd
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

We bring the main asynchronous modules and the begging of Python, Asyncio for competition, JSON for analysis and pandas for tabular storage, alongside the essential elements of crawl4ai: asyncwebcrawler to drive the crawl, crawlerconfig backend http without browser and JSONCSSEXTRACKSTRATERGY to map CSS selectors in structured JSON.

http_cfg = HTTPCrawlerConfig(
    method="GET",
    headers={
        "User-Agent":      "crawl4ai-bot/1.0",
        "Accept-Encoding": "gzip, deflate"
    },
    follow_redirects=True,
    verify_ssl=True
)
crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)

Here, we instance an HTTPCRAWROCONFIG to define the behavior of our Crawler HTTP, using a GET request with a personalized user agent, a GZIP / DEFLATE coding only, automatic redirects and SSL verification. We then connect it to AsynchttpcrawleStrategy, allowing Crawl4ai to make the ramp via pure HTTP calls rather than a full browser.

schema = {
    "name": "Quotes",
    "baseSelector": "div.quote",
    "fields": (
        {"name": "quote",  "selector": "span.text",      "type": "text"},
        {"name": "author", "selector": "small.author",   "type": "text"},
        {"name": "tags",   "selector": "div.tags a.tag", "type": "text"}
    )
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)

We define a JSON-CSS extraction scheme targeting each quote block (div.quote) and its children's elements (span.text, small.autor, div.tags A.TAG), then initializes a jsoncssextracksty with this scheme, and envelope in a Crawruconfig sor crawl4ai application knows exactly what the structured data request.

async def crawl_quotes_http(max_pages=5):
    all_items = ()
    async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:
        for p in range(1, max_pages+1):
            url = f"https://quotes.toscrape.com/page/{p}/"
            try:
                res = await crawler.arun(url=url, config=run_cfg)
            except Exception as e:
                print(f"❌ Page {p} failed outright: {e}")
                continue


            if not res.extracted_content:
                print(f"❌ Page {p} returned no content, skipping")
                continue


            try:
                items = json.loads(res.extracted_content)
            except Exception as e:
                print(f"❌ Page {p} JSON‑parse error: {e}")
                continue


            print(f"✅ Page {p}: {len(items)} quotes")
            all_items.extend(items)


    return pd.DataFrame(all_items)

Now, this asynchronous function orchestra the crawl http-only: it runs an asyncwebcrawler with our asynchttpcrawlerstrategy, itera through each url page and safely awaits the channel.arun (), treats any request or errors of the JSON analysis and collect the extraction citation recordings in one pandas dataframe for an analysis of the extraction target.

df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))
df.head()

Finally, we are launching the Coroutine Crawl_quotes_http on the existing asyncio loop of Colab, recovering three pages of quotes, then display the first lines of the DataFrame Pandas resulting to check that our robot has returned structured data as expected.

In conclusion, by combining the zero config environment of Google Colar with the asynchronous Python ecosystem and the flexible crawling strategies of Crawl4ai, we have now developed a fully automated pipeline to scratch and structure web data in a few minutes. Whether you need to run a rapid quote data set, build a refreshable information archive or feed a CLOTH Workflow, Crawl4ai Blend from HTTPX, Asyncio, JSONCSSEXTRACTRACTRATEGY and AsynchtTPCRAWLERSTRAGY offers both simplicity and scalability. Beyond the HTTP Pure Ramps, you can instantly rotate the automation of the browser focused on the playwright without rewrite your extraction logic, emphasizing why Crawl4ai stands out as the framework for modern web data extraction ready for production.

Here is the Colaab. Also, don't forget to follow us Twitter And join our Telegram And Linkedin Group. Don't forget to join our 90K + ML Subdreddit.

???? (Register now) Minicon Virtual Conference on AIA: Free registration + presence certificate + 4 hours (May 21, 9 a.m. to 1 p.m. PST) + Practical workshop

Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About

A asynchronous web data extraction coding guide using crawl4ai: a toolbox on the open-source web rampant designed for LLM workflows