Extracting Financial Data from SEC 10-K Filings with LLMs
Finding structure in SEC filings
Every year, U.S. public companies file a comprehensive financial report called a 10-K. These reports are filed in HTML format, which can complicate automated parsing. Traditional approaches to extracting data from these HTML tables are challenging, time-consuming, and often imprecise. What should be a straightforward data extraction task becomes an endless game of whack-a-mole with edge cases.
Enter LLMs using structured generation.
This blog post demonstrates how we can use structured text generation to cut through the chaos and extract clean, consistent data directly from 10-K reports. We'll show you how to transform messy HTML tables into neat CSVs (easily read by Excel) that are primed for analysis.
Existing solutions
Several common approaches to parsing 10-K filings exist, but each has its limitations:
- Manual Extraction: Time-consuming and prone to errors.
- Custom Parsing Tools: Require frequent updates as companies alter their reporting formats.
- SEC's XBRL Format: Though machine-readable, the use of custom tags by companies hinders effective cross-company comparison.
Unfortunately, none of these methods provide a comprehensive solution to the challenge.
A solution: structured generation
Fortunately, we can use structured generation to extract the information we need from the 10-K directly into tabular data. Feed some ugly text into our model, get a fresh CSV on the other side.
Put simply, we're going to go from this unpleasant mess:
Figure 1: NVIDIA's income statement
to this clean, tidy CSV:
year revenue operating_income net_income 0 2024 60922 32972 29760 1 2023 26974 4224 4368 2 2022 26914 10041 9752
Our goal: extracting key financial metrics
Let's focus on three essential numbers from a company's income statement: revenue, operating income, and net income.
Some definitions for you:
- Revenue: The total amount of money earned by a company from its primary business activities, typically from selling goods or services.
- Operating Income: The profit earned from a company's core business operations, calculated by subtracting operating expenses from revenue.
- Net Income: The company's total profit after all expenses, taxes, and other costs have been deducted from revenue.
Let's take a look at what these three metrics look like in earnings reports.
What the reports look like
Here are two examples of income statements.
The first is Microsoft. We can see revenue at the top, operating income in the middle, net income on the bottom, and three columns representing each reporting year.
Figure 2: Microsoft's income statement
Let's compare that to Alphabet's income statement.
Figure 3: Alphabet's income statement
Note a few significant issues that complicate simple parsing strategies:
- Inconsistent naming (”Total Revenue” vs. "Revenues")
- Order and field formatting is completely different
- Year column ordering varies
The takeaway: these are not the same documents, but they communicate the same meaning.
Thankfully, language models are great at understanding meaning without being too caught up in details like row formatting or differences in naming conventions.
The four steps to extract earnings information
We need to do four things to work with 10-K reports.
- Clean up the HTML 10-K into a simpler format
- Find the income statement
- Defining the shape of the model's output
- Extract the data we want from the income statement
Preprocessing the text
The 10-K native format of HTML contains considerable markup that doesn't add too much information while simultaneously eating up valuable context.
We can pass the HTML directly into our model, but this is costly, slow, and inefficient.
We do not like inefficiency at .txt.
When we work with language models, we want text that contains the most information with the fewest tokens. Markdown is a great example of an efficient format. HTML can be easily converted to markdown using a package like markdownify
.
Take a look at data we've converted:
FORM 10-K ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended January 28, 2024 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 Commission file number: 0-23985 NVIDIA CORPORATION (Exact name of registrant as specified in its charter) Delaware
Here's the original HTML:
<div style="text-align:center"> <span style="color:#000000;font-family:'NVIDIA Sans',sans-serif;font-size:13pt;font-weight:700;line-height:120%"> FORM <ix:nonnumeric contextref="c-1" name="dei:DocumentType" id="f-1">10-K</ix:nonnumeric> </span> </div> <div style="text-align:center"> <table style="border-collapse:collapse;display:inline-table;margin-bottom:5pt;vertical-align:text-bottom;width:93.750%"> <tbody> <tr> <td style="width:1.0%"></td> <td style="width:2.650%"></td> <td style="width:0.1%"></td> <td style="width:1.0%"></td> <td style="width:95.150%"></td> <!-- HTML continues -->
Much better.
Finding the income statement
The markdown version of our HTML file is still a large document, in excess of 120k tokens. We could pass the entire document into a language model like Phi 3.5, but filling the entire window is slow and requires significant hardware. Further, providing too much information to your model can often confuse it.
Income statements in the 10-K may span multiple pages, occur in different locations across reports, or generally be difficult to find in an automated way.
What we're going to do is classify each page as being "about" the income statement. Pages that have been classified as income statement-related will be used for extraction purposes, and we discard all other pages.
With Outlines, this classifier is trivial:
import outlines # Load our model model = outlines.models.transformers("microsoft/Phi-3.5-mini-instruct") # Classification function yesno = outlines.generate.choice(model, ['Yes', 'Maybe', 'No']) # Requesting a classification from the model result = yesno( "Is the following document about an income statement? Document: {...}" ) # Do something if it's a "Yes" if result == 'Yes': ...
Our demo provides more sophisticated classification prompts, as well as how to iterate through all the pages of the 10-K.
After the classification step, we'll usually have a chunk of the 10-K that contains the income statement in markdown format:
| | | | | | | | | | | | | | | | | | | |--------------------------------------------------------|---|---|--------------|--------|---|---|---|---|--------------|--------|---|---|---|---|--------------|--------|---| | | | | Year Ended | | | | | | | | | | | | | | | | | | | Jan 28, 2024 | | | | | | Jan 29, 2023 | | | | | | Jan 30, 2022 | | | | Revenue | | | $ | 60,922 | | | | | $ | 26,974 | | | | | $ | 26,914 | | | Cost of revenue | | | 16,621 | | | | | | 11,618 | | | | | | 9,439 | | | | Gross profit | | | 44,301 | | | | | | 15,356 | | | | | | 17,475 | | | | Operating expenses | | | | | | | | | | | | | | | | | | | Research and development | | | 8,675 | | | | | | 7,339 | | | | | | 5,268 | | | | Sales, general and administrative | | | 2,654 | | | | | | 2,440 | | | | | | 2,166 | | | | Acquisition termination cost | | | | | | | | | 1,353 | | | | | | | | | | Total operating expenses | | | 11,329 | | | | | | 11,132 | | | | | | 7,434 | | | | Operating income | | | 32,972 | | | | | | 4,224 | | | | | | 10,041 | | | | Interest income | | | 866 | | | | | | 267 | | | | | | 29 | | | | Interest expense | | | (257) | | | | | | (262) | | | | | | (236) | | | | Other, net | | | 237 | | | | | | (48) | | | | | | 107 | | | | Other income (expense), net | | | 846 | | | | | | (43) | | | | | | (100) | | | | Income before income tax | | | 33,818 | | | | | | 4,181 | | | | | | 9,941 | | | | Income tax expense (benefit) | | | 4,058 | | | | | | (187) | | | | | | 189 | | | | Net income | | | $ | 29,760 | | | | | $ | 4,368 | | | | | $ | 9,752 | | | | | | | | | | | | | | | | | | | | | | Net income per share: | | | | | | | | | | | | | | | | | | | Basic | | | $ | 12\.05 | | | | | $ | 1.76 | | | | | $ | 3.91 | | | Diluted | | | $ | 11\.93 | | | | | $ | 1.74 | | | | | $ | 3.85 | | | | | | | | | | | | | | | | | | | | | | Weighted average shares used in per share computation: | | | | | | | | | | | | | | | | | | | Basic | | | 2,469 | | | | | | 2,487 | | | | | | 2,496 | | | | Diluted | | | 2,494 | | | | | | 2,507 | | | | | | 2,535 | | |
The markdown table is easy enough for us to read, but it still has the same machine readability issues as before:
- Inconsistent names + line item locations
- Varying line item order
- Extraneous information and formatting
These challenges make it difficult for traditional rule-based systems to accurately and consistently extract the financial data we need.
This is where we can use structured generation to extract the numbers we want directly into a CSV.
Telling our model exactly what we want
Remember those Excel spreadsheets you've worked with? That's essentially what we're trying to create here – a neat table where each row represents a year of financial data, with columns for the values we want to extract.
We want our final output to look like this:
year revenue operating_income net_income 2024 60922 32972 29760 2023 26974 4224 4368 2022 26914 10041 9752
Outlines allows you to specify a "structure" for your language model's output using regular expressions. In our case, this means writing a linear expression that describes the type of each column in the CSV.
Think of it like creating a template:
- Year should always be four digits (like 2024, not just 24)
- Numbers should be whole (no decimals)
- Every value needs to be separated by commas
- Numbers can be positive or negative
- No dollar signs or other special characters allowed
Using structured generation, we can enforce these rules using something called a "regular expression" – essentially a template that our model must follow. The regular expression looks pretty scary:
year,revenue,operating_income,net_income(\n(\\d{4}),((-?\\d+),?\\d+|(\\d+)),((-?\\d+),?\\d+|(\\d+)),((-?\\d+),?\\d+|(\\d+))){,3}\n\n
Raw regular expressions can be difficult to read, so here's a quick diagram to illustrate what each piece of it means:
Figure 4: CSV regular expression diagram
Don't worry too much about understanding the regular expression if you're not familiar with them. We've included a helper function in the demo that makes creating these patterns much easier. The important thing is that this pattern acts like a strict template, ensuring our model always produces clean, consistent data that's ready for analysis.
Structured generation with Outlines will produce output that is only consistent with this format. It cannot fail to include a comma, add a two-digit year suffix like 24 instead of 2024, etc. Any structured output produced using this regular expression will always be parsable by standard tabular data tools like pandas
, Excel, databases, etc.
Extracting our data
Now comes the fun part – extracting the data we want with our LLM. Let's walk through how we turn that messy financial statement into clean, usable numbers.
Equipped with our income statement and the expected output format, we can build an "extractor" function. This function will accept a prompt containing the income statement and some processing instructions, and return text in the format we specified.
# Make an extractor function csv_extractor = outlines.generate.regex( model, csv_regex, sampler=outlines.samplers.greedy() )
With our extractor ready, we can feed it the income statement. Calling csv_extractor
asks our LLM to generate new tokens:
# Extract our table csv_data = csv_extractor( extract_financial_data_prompt( columns_to_extract, income_statement ) )
We'll omit the full prompt for brevity, but it can be found here.
And just like that, we get our clean, structured data for NVIDIA:
year,revenue,operating_income,net_income 2024,60922,32972,29760 2023,26974,4224,4368 2022,26914,10041,9752
If you are an individual used to working with tabular data tools like pandas
, you can convert this into a standard, programmable data frame with
from io import StringIO import pandas as pd # Load into a dataframe df = pd.read_csv(StringIO(csv_data))
Which gives you a nice, organized table:
year revenue operating_income net_income 2 2022 26914 10041 9752 1 2023 26974 4224 4368 0 2024 60922 32972 29760
That's it! See how simple it was to convert these horrible HTML documents to a simple CSV?
Verifying our results
While the structured generation extraction approach is powerful, it can have your standard language model issues.
Before using the approach, try to follow some best practices:
- Test Thoroughly: We've tested this approach with reports from Microsoft, NVIDIA, and Alphabet, comparing the results against manually extracted data. You can find these tests in our repository.
- Know the Limitations: Getting reliable results requires careful prompt engineering and data cleaning. LLMs can invent information.
- Stay Updated: SEC filing formats and structures may change over time. Regularly review and update your extraction methods to ensure they remain effective and accurate.
Conclusion
We've explored how structured generation can be used to extract financial data from SEC filings, specifically focusing on income statements from 10-K reports. Structured generation can transform chaotic HTML tables into clean, analysis-ready data.
However, the implications of this approach extend far beyond earnings reports.
- Extracting specific data from scientific papers
- Converting legal documents to structured contract terms
- Cleaning product catalogs for E-commerce sites
- Standardizing patient data using medical records
However, it's crucial to remember that while this method is powerful, it's not infallible. As with any AI-driven process, the results should be verified and validated, especially when dealing with critical data.
With structured generation, we're not just parsing documents - we're unlocking the potential of human-readable data at machine scale.
Getting started
Want to try this yourself? Check out full demonstration repo to play with all the detailed code, or see our cookbook for a simple example.