Using Google’s LangExtract and Gemma for Structured Data Extraction

like insurance policies, medical records, and compliance reports are notoriously long and tedious to parse.

Important details (e.g., coverage limits and obligations in insurance policies) are buried in dense, unstructured text that is challenging for the average person to sift through and digest.

Large language models (LLMs), already known for their versatility, serve as powerful tools to cut through this complexity, pulling out the key facts and turning messy documents into clear, structured information.

In this article, we explore Google’s LangExtract framework and its open-source LLM, Gemma 3, which together make extracting structured information from unstructured text accurate and efficient.

To bring this to life, we will walk through a demo on parsing an insurance policy, showing how details like exclusions can be surfaced effectively.

(1) Understanding LangExtract and Gemma
(2) Under the Hood of LangExtract
(3) Example Walkthrough

The accompanying GitHub repo can be found here.

(1) Understanding LangExtract and Gemma

(i) LangExtract

LangExtract is an open-source Python library (released under Google’s GitHub) that uses LLMs to extract structured information from messy unstructured text based on user-defined instructions.

It enables LLMs to excel at named entity recognition (such as coverage limits, exclusions, and clauses) and relationship extraction (logically linking each clause to its conditions) by efficiently grouping related entities.

Its popularity stems from its simplicity, as just a few lines of code are enough to perform structured information extraction. Beyond its simplicity, several key features make LangExtract stand out:

Exact Source Alignment: Each extracted item is linked back to its precise location in the original text, ensuring full traceability.
Built for Long Documents: Handles the “needle-in-a-haystack” problem with smart chunking, parallel processing, and iterative passes to maximize recall so as to find additional entities.
Broad Model Compatibility: Works seamlessly with different LLMs, from cloud-based models like Gemini to local open-source options.
Domain Agnostic: Adapts to any domain with only a handful of examples, removing the need for costly fine-tuning.
Consistent Structured Outputs: Uses few-shot examples and controlled generation (only for certain LLMs like Gemini) to enforce a stable output schema and produce reliable, structured results.
Interactive Visualization: Generates an interactive HTML file to visualize and review extracted entities in their original context.

(ii) Gemma 3

Gemma is a family of lightweight, state-of-the-art open LLMs from Google, built from the same research used to create the Gemini models.

Gemma 3 is the latest release in the Gemma family, and is available in five parameter sizes: 270M, 1B, 4B, 12B, and 27B. It is also purported to be the current, most capable model that runs on a single GPU.

It can handle prompt inputs of up to 128K tokens, allowing us to process many multi-page articles (or hundreds of images) in a single prompt.

In this article, we will use the Gemma 3 4B model (4-billion parameter variant), deployed locally via Ollama.

(2) Under the Hood of LangExtract

LangExtract comes with many standard features expected in modern LLM frameworks, such as document ingestion, preprocessing (e.g., tokenization), prompt management, and output handling.

What caught my attention are the three capabilities that support optimized long-context information extraction:

Smart chunking
Parallel processing
Multiple extraction passes

To see how these were implemented, I dug into the source code and traced how they work under the hood.

(i) Chunking strategies

LangExtract uses smart chunking strategies to improve extraction quality over a single inference pass on a large document.

The goal is to split documents into smaller, focused chunks of manageable context size, so that the relevant text is kept in a way that is well-formed and easy to understand.

Instead of mindlessly cutting at character limits, it respects sentences, paragraphs, and newlines.

Here is a summary of the key behaviors in the chunking strategy:

Sentence- and paragraph-aware: Chunks are formed from whole sentences where possible (by respecting text delimiters like paragraph breaks), so that the context stays intact.
Handles long sentences: If a sentence is too long, it is broken at natural points like newlines. Only if necessary will it split inside a sentence.
Edge case handling: If a single word or token is longer than the limit, it becomes a chunk to avoid errors.
Token-based splitting: All cuts respect token boundaries, so words are never split mid-way.
Context preservation: Each chunk carries metadata (token and character positions) that map it back to the source document.
Efficient processing: Chunks can be grouped into batches and processed in parallel, so quality gains do not add extra latency.

As a result, LangExtract creates well-formed chunks that pack in as much context as possible while avoiding messy splits, which helps the LLM maintain extraction quality across large documents.

(ii) Parallel processing

LangExtract’s support for parallel processing at LLM inference (as seen in model provider scripts) enables extraction quality to be high over long documents (i.e., good entity coverage and attribute assignment) without significantly increasing overall latency.

When given a list of text chunks, the max_workers parameter controls how many tasks can run in parallel. These workers send multiple chunks to the LLM simultaneously, with up to max_workers chunks processed in parallel.

(iii) Multiple extraction passes

The purpose of iterative extraction passes is to improve the recall by capturing entities that might be missed in any single run.

In essence, it adopts a multi-sample and merge strategy, where extraction is run multiple times independently, relying on the LLM’s stochastic nature to surface entities that might be missed in a run.

Afterwards, results from all passes are merged. If two extractions cover the same region of text, the version from the earlier pass is kept.

This approach boosts recall by capturing additional entities across runs, while resolving conflicts by a first-pass-wins rule. The downside is that it reprocesses tokens multiple times, which can increase costs.

(3) Example Walkthrough

Let’s put LangExtract and Gemma to the test on a sample motor insurance policy document, found publicly on the MSIG Singapore website.

Check out the accompanying GitHub repo to follow along.

Preview of the MSIG Motor Insurance document | Source: MSIG Singapore

(i) Initial Setup

LangExtract can be installed from PyPI with:

pip install langextract

We then download and run Gemma 3 (4B model) locally with Ollama.

Ollama is an open-source tool that simplifies running LLMs on our computer or a local server. It allows us to interact with these models without needing an Internet connection or relying on cloud services.

To install Ollama, visit the Downloads page and choose the installer for your operating system. Once done, verify the installation by running ollama --version in your terminal.

Important: Ensure your local device has GPU access for Ollama, as this dramatically accelerates performance.

After Ollama is installed, we get the service running by opening the application (macOS or Windows) or entering ollama serve for Linux.

To download Gemma 3 (4B) locally (3.3GB in size), we run this command: ollama pull gemma3:4b, after which we run ollama list to verify that Gemma is downloaded locally on your system.

(ii) PDF Parsing and Processing

The first step is to read the PDF policy document and parse the contents using PyMuPDF (installed with pip install PyMuPDF).

We create a Document class storing a piece of text and associated metadata, and a PDFProcessor class for the overall document parsing.

Here is an explanation of the code above:

load_documents: Goes through each page, extracts text blocks, and saves them as Document objects. Each block includes the text and metadata (e.g., page number, coordinates with page width/height).
The coordinates capture where the text appears on the page, preserving layout information such as whether it is a header, body text, or footer.
get_all_text: Combines all extracted text into one string, with clear markers separating pages.
get_page_text: Gets only the text from a specific page.

(iii) Prompt Engineering

The next step is to provide instructions to guide the LLM in the extraction process via LangExtract.

We begin with a system prompt that specifies the structured information we want to extract, focusing on the policy exclusion clauses.

In the prompt above, I explicitly specified a JSON output as the expected response format. Without this, we will likely hit an error of langextract.resolver.ResolverParsingError.

The issue is that Gemma does not include built-in structured-output enforcement, so by default, it outputs unstructured text in natural language. It may then inadvertently include extra text or malformed JSON, potentially breaking the strict JSON parsers in LangExtract.

However, if we use LLMs like Gemini that have schema-constrained decoding (i.e., configurable for structured output), then prompt lines 11–21 can be omitted.

Next, we introduce few-shot prompting by providing an example of what exclusion clauses mean in the context of insurance.

LangExtract’s ExampleData class serves as a template that shows the LLM worked examples of how text should map to structured outputs, informing it what to extract and how to format it.

It contains a list of Extraction objects representing the desired output, where each one is a container class comprising attributes of a single piece of extracted information.

(iv) Extraction Run

With our PDF parser and prompts set up, we are ready to run the extraction with LangExtract’s extract method:

Here is an explanation of the parameters above:

We pass our input text, prompts, and several few-shot examples into the text_or_documents, prompt_description, and examples parameters respectively
We pass the model version gemma3:4b into model_id
The model_url is defaulted to Ollama’s local endpoint (http://localhost:11434). Ensure that the Ollama service is already running on your local machine
We set fence_output and use_schema_constraint to False since Gemma is not geared for structured output, and LangExtract does not yet support schema constraints for Ollama
max_char_buffer sets the maximum number of characters for inference. Smaller values improve accuracy (by reducing context size) but increase the number of LLM calls
extraction_passes sets the number of extraction passes for improved recall in the extraction

On my 8GB VRAM GPU, the 10-page document took <10 minutes to complete parsing and extraction.

(v) Save and Postprocess Output

We finally save the output using LangExtract’s io module:

Custom post-processing is then applied to beautify the result for easy viewing, and here is a snippet of the output:

We can see that the LLM responses contain structured extractions from the original text, grouping them by class (specifically exclusions) and providing both the source text line and a plain-English explanation.

This format makes complex insurance clauses easier to interpret, offering a clear mapping between formal policy language and simple summaries.

(4) Wrapping it up

In this article, we explored how LangExtract’s chunking, parallel processing, and iterative passes, combined with Gemma 3’s capabilities, enable reliable extraction of structured data from lengthy documents.

These techniques demonstrate how the right combination of models and extraction strategies can turn long, complex documents into structured insights that are accurate, traceable, and ready for practical use.

Before You Go

I invite you to follow my GitHub and LinkedIn pages for more engaging and practical content. Meanwhile, have fun extracting structured information with LangExtract and Gemma 3!

Using Google’s LangExtract and Gemma for Structured Data Extraction

Contents