Extracting Structured Data with LangExtract: A Deep Dive into LLM-Orchestrated Workflows

Having developed raw LLM workflows for structured extraction tasks, I have observed several pitfalls in them over time. In one of my projects, I developed two independent workflows using Grok and OpenAI to see which one performed better for structured extraction. This was when I noticed that both were omitting facts in random places. Moreover, the fields extracted did not align with the schema.

To counter these issues, I set up special handling and validation checks that would make the LLM revisit the document (like a second pass) so that missing facts could be caught and added back to the output document. However, multiple validation runs were causing me to exceed my API limits. Moreover, prompt fine-tuning was a real bottleneck. Every time I modified the prompt to ensure that the LLM didn’t miss a fact, a new issue would get introduced. An important constraint I noticed was that while one LLM worked well for a set of prompts, the other wouldn’t perform that well with the same set of instructions. These issues prompted me to look for an orchestration engine that could automatically fine-tune my prompts to match the LLM’s prompting style, handle fact omissions, and ensure that my output was aligned with my schema.

I recently came across LangExtract and tried it out. The library addressed several issues I was facing, particularly around schema alignment and fact completeness. In this article, I explain the basics of LangExtract and how it can augment raw LLM workflows for structured extraction problems. I also aim to share my experience with LangExtract using an example.

Why LangExtract?

It is a known fact that when you set up a raw LLM workflow (say, using OpenAI to gather structured attributes from your corpus), you would have to establish a chunking strategy to optimize token usage. You would also need to add special handling for missing values and formatting inconsistencies. When it comes to prompt engineering, you would have to add or remove instructions to your prompt with every iteration; in an attempt to fine-tune the results and to handle discrepancies.

LangExtract helps manage the above by effectively orchestrating prompts and outputs between the user and the LLM. It fine-tunes the prompt before passing it to the LLM. In cases where the input text or documents are large, it chunks the data and feeds it to the LLM while ensuring that we stay within the token limits prescribed by each model (e.g., ~8000 tokens for GPT-4 vs ~10000 tokens in Claude). In cases where speed is crucial, parallelization can be set up. Where token limits are a constraint, sequential execution could be set up. I will try to break down the working of LangExtract along with its data structures in the next section.

Data Structures and Workflow in LangExtract

Below is a diagram showing the data structures in LangExtract and the flow of data from the input stream to the output stream.

An Illustration of the Data Structures used by LangExtract
(Image by the Author)

LangExtract stores examples as a list of custom class objects. Each example object has a property called ‘text’, which is the sample text from a news article. Another property is the ‘extraction_class’, which is the category assigned to the news article by the LLM during execution. As an example, a news article that talks about a cloud provider would be tagged under ‘Cloud Infrastructure’. The ‘extraction_text’ property is the reference output you provide to the LLM. This reference output guides the LLM in inferring the closest output you would expect for a similar news snippet. The ‘text_or_documents’ property stores the actual dataset that requires structured extraction (in my example, the input documents are news articles).

Few-shot prompting instructions are sent to the LLM of choice (model_id) through LangExtract. LangExtract’s core ‘extract()’ function gathers the prompts and passes them to the LLM after fine-tuning the prompt internally to match the prompt style of the chosen LLM, and to prevent model discrepancies. The LLM then returns the result one at a time (i.e., one document at a time) to LangExtract, which in turn yields the result in a generator object. The generator object is similar to a transient stream that yields the value extracted by the LLM. An analogy for a generator being a transient stream would be a digital thermometer, which gives you the current reading but doesn’t really store readings for future reference. If the value in the generator object isn’t captured immediately, it is lost.

Note that the ‘max_workers’ and ‘extraction_pass’ properties have been discussed in detail in the section ‘Best Practices in using LangExtract’.

Now that we’ve seen how LangExtract works and the data structures used by it, let’s move on to applying LangExtract in a real-world scenario.

A Hands-on Implementation of LangExtract

The use case involves gathering news articles from the “techxplore.com RSS Feeds”, related to the technology business domain (https://techxplore.com/feeds/). We use Feedparser and Trifaltura for URL parsing and extraction of article text. Prompts and examples are created by the user and fed to LangExtract, which performs orchestration to ensure that the prompt is tuned for the LLM that is being used. The LLM processes the data based on the prompt instructions along with the examples provided, and returns the data to LangExtract. LangExtract once again performs post-processing before displaying the results to the end user. Below is a diagram showing how data flows from the input source (RSS feeds) into LangExtract, and finally through the LLM to yield structured extractions.

Below are the libraries that have been used for this demonstration.

We begin by assigning the Tech Xplore RSS feed URL to a variable ‘feed_url’. We then define a ‘keywords’ list, which contains keywords related to tech-business. We define three functions to parse and scrape news articles from the news feed. The function ‘get_article_urls()’ parses the RSS feed and retrieves the article title and individual article URL (link). Feedparser is used to accomplish this. The ‘extract_text()’ function uses Trifaltura to extract the article text from the individual article URL returned by Feedparser. The function ‘filter_articles_by_keywords’ filters the retrieved articles based on the keywords list defined by us.

Upon running the above, we get the output-
“Found 30 articles in the RSS feed
Filtered articles: 15″

Now that the list of ‘filtered_articles’ is available, we go ahead and set up the prompt. Here, we give instructions to let the LLM understand the type of news insights we are interested in. As explained in the section “Data Structures and Workflow in LangExtract”, we set up a list of custom classes using ‘data.ExampleData()’, which is an inbuilt data structure in LangExtract. In this case, we use few-shot prompting consisting of multiple examples.

We initialize a list called ‘results’ and then loop through the ‘filtered_articles’ corpus and perform the extraction one article at a time. The LLM output is available in a generator object. As seen earlier, being a transient stream, the output value in the ‘result_generator’ is immediately appended to the ‘results’ list. The ‘results’ variable is a list of annotated documents.

We iterate through the results in a ‘for loop’ to write each annotated document to a jsonl file. Though this is an optional step, it can be used for auditing individual documents if required. It is worth mentioning that the official documentation of LangExtract offers a utility to visualize these documents.

We loop through the ‘results’ list to gather every extraction from an annotated document one at a time. An extraction is nothing but one or more attributes requested by us in the schema. All such extractions are stored in the ‘all_extractions’ list. This list is a flattened list of all extractions of the form [extraction_1, extraction_2, extraction_n].

We get 55 extractions from the 15 articles that were gathered earlier.

The final step involves iterating through the ‘all_extractions’ list to gather each extraction. The Extraction object is a custom data structure within LangExtract. The attributes are gathered from each extraction object. In this case, Attributes are dictionary objects that have the metric name and value. The attributes/metric names match the schema initially requested by us as part of the prompt (Refer to the ‘attributes’ dictionary provided ‘examples’ list in the ‘data.Extraction’ object). The final results are made available in a dataframe, which can be used for further analysis.

Below is the output showing the first five rows of the dataframe –

Best Practices for Using LangExtract Effectively

Few-shot Prompting

LangExtract is designed to work with a one-shot or few-shot prompting structure. Few-Shot prompting requires you to give a prompt and a few examples that explain the output you expect the LLM to yield. This prompting style is especially useful in complex, multidisciplinary domains like trade and export where data and terminology in one sector can be vastly different from that of the other. Here’s an example – A news snippet reads, ‘The value of Gold went up by X’ and another snippet reads ‘The value of a particular type of semiconductor went up by Y’. Here, though both snippets say ‘value’, they mean very different things. When it comes to precious metals like Gold, the value is based on the market price per unit whereas with semiconductors, it could mean the market size or strategic worth. Providing domain-specific examples can help the LLM fetch the metrics with the nuance that the domain demands. The more the examples the better. A broad example set can help both the LLM model and LangExtract adapt to different writing styles (in articles) and avoid misses in extraction.

Multi-Extraction Pass

A Multi-Extraction pass is the act of having the LLM revisit the input dataset more than once to fill in details missing in your output at the end of the first pass. LangExtract guides the LLM to revisit the dataset (input) multiple times by fine-tuning the prompt during each run. It also effectively manages the output by merging the intermediate outputs from the first and subsequent runs. The number of passes that need to be added is provided using the ‘extraction_passes’ parameter in the extract() module. Though an extraction pass of ‘1’ would work here, anything beyond ‘2’ will help yield an output that is more fine-tuned and aligned with the prompt and the schema provided. Moreover, a multi-extraction pass of 2 or more ensures that the output schema is on par with the schema and attributes you provided in your prompt description.

Parallelization

When you have large documents that could potentially consume the permissible number of tokens per request, it is ideal to go for a sequential extraction process. A sequential extraction process can be enabled by setting max_workers = 1. This causes LangExtract to force the LLM to process the prompt in a sequential manner, one document at a time. If speed is key, parallelization can be enabled by setting max_workers = 2 or more. This ensures that multiple threads become available for the extraction process. Moreover, the time.sleep() module can be used when sequential execution is being performed to ensure that the request quotas of LLMs are not exceeded.

Both parallelization and multi-extraction pass can be set as below –

Concluding Remarks

In this article, we learnt how to use LangExtract for structured extraction use cases. By now, it should be clear that having an orchestrator such as LangExtract for your LLM can help with prompt fine-tuning, data chunking, output parsing, and schema alignment. We also saw how LangExtract operates internally by processing few-shot prompts to suit the chosen LLM and parsing the raw output from the LLM to a schema-aligned structure.

Extracting Structured Data with LangExtract: A Deep Dive into LLM-Orchestrated Workflows

Why LangExtract?

Data Structures and Workflow in LangExtract

A Hands-on Implementation of LangExtract

Best Practices for Using LangExtract Effectively

Few-shot Prompting

Multi-Extraction Pass

Parallelization

Concluding Remarks

Related Posts

Hands-On with Agents SDK: Safeguarding Input and Output with Guardrails

The Programming Skills You Need for Today’s Data Roles

Leave a Reply Cancel reply