Google AI Sorting Langextract: an open source Python library that extracts structured data from non -structured text documents

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

In the current world focused on data, valuable information is often buried in an unstructured text, are clinical notes, long legal contracts or customer feedback threads. Extracting significant and traceable information from these documents is both a technical and practical challenge. The new Python Open Source library from Google ,, Langextractis designed to remedy this gap directly, using LLMs such as Gemini to offer powerful and automated extraction with traceability and transparency to its heart.

1. declarative and traceable extraction

Langextract allows users to define personalized extraction tasks using natural language instructions and high -quality “rare” examples. This allows developers and analysts to Specify exactly which entities, relationships or facts to be extracted, and in which structure. Above all, each element of information extracted is directly linked to its source text—Sut the validation, audit and traceability from start to finish.

2. Polyvylity in the field

The library operates not only in technological demos but in the critical fields of the real world – including health (clinical notes, medical reports), finance (summaries, risk documents), law (contracts), research literature and even the arts (analyzing Shakespeare). Original use cases include automatic extraction of drugs, doses and details of the administration from clinical documents, as well as the relationships and emotions of the parts or the literature.

3 and 3 Application of the diagram with LLMS

Propelled by Gemini and compatible with other LLMs, LangeXtract Active Application of personalized output schemes (like JSON), so the results are not only exact – they are immediately usable in downstream databases, IA analyzes or pipelines. It solves the traditional LLM weaknesses around hallucination and diagram drift by putting the implementation of outputs to the instructions of the user and the real source text.

4 Scalability and visualization

5 Installation and use

Easily install with PIP:

Example of workflow (extraction of information on Shakespeare characters):

import langextract as lx
import textwrap

# 1. Define your prompt
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")

# 2. Give a high-quality example
examples = (
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=(
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ),
    )
)

# 3. Extract from new text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)

# 4. Save and visualize results
lx.io.save_annotated_documents((result), output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

The result is structured and anchored JSON outputs, as well as an interactive HTML visualization for easy review and demonstration.

Specialized and real applications

The team even provides a demonstration called Raade To structure the radiology reports – the lightning of breeding not only what was extracted, but exactly where the information appeared in the original entry.

How Langextract is compared

Functionality Traditional approaches Langextract approach
Coherence of the scheme Often manual / subject to errors Applied via instructions and examples a few strokes
Results Tradibility Minimal All outputs related to the entry text
Long -text scaling Fented, with loss Parallel extraction in pieces, then to aggregation
Visualization Custom, generally absent Integrated and interactive HTML reports
Deployment Rigid, specific to the model Gemini-Prime, open to other LLM and on site

In summary

Langextract presents a new era to extract structured and usable data from the text – Disputes:

  • Declarative and explainable extraction
  • Traceable results supported by the source context
  • Instant visualization for rapid iteration
  • Easy integration in any python workflow

Discover the GitHub page And Technical blog. Do not hesitate to consult our GitHub page for tutorials, codes and notebooks. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.