Introduction

Large Language Models (LLMs) are powerful tools for analyzing and extracting insights from text. However, working with large, messy documents β€” PDFs, Word files, scanned reports β€” presents unique challenges.

In this blog, I will walk you through:

  • What is Unstructured?
  • Why I moved away from PyPDF
  • Parsing an 8-page document (Dota 2 heroes guide) using Unstructured
  • How chunking improves parsing quality even with the free version
  • Limitations of the free version
  • Hands-on code examples and outputs

What is Unstructured?

Unstructured is an open-source Python library designed to help you extract text cleanly from documents like PDFs, DOCX, HTML, images, and more.

It comes in two flavors:

Version Key Features
Unstructured Library (Free) Local processing, customizable, no API costs, good for simpler documents.
Unstructured API (Paid) Handles complex documents, AI-based OCR, can use for scanned/handwritten files, offers advanced chunking strategies like by title, by page, by similarity.

Why I Switched to Unstructured After PyPDF

I initially used PyPDFLoader (Langchain's wrapper for PyPDF) for document parsing. It worked well until documents became complex and I began to notice:

Problem Example
Weird Characters Text full of unreadable symbols.
Extra Spaces Letters spaced out weirdly.
Jumbled Order Headings and paragraphs mixed up.

Especially with longer PDFs (100+ pages, images, and multiple columns), PyPDF struggled.

After struggling for days, I moved to Unstructured β€” and immediately saw clean, structured, usable outputs!

Before We Proceed: Key Terms

πŸ“š Key Definitions

Elements: When Unstructured parses a document, it outputs a list of Element objects, such as Title, NarrativeText, ListItem, etc.

Chunking: Chunking is splitting a large document into smaller parts ("chunks") to feed into LLMs more effectively. (Example: split by page, heading, or token size.)

STEP 1: Setting up the Project

We are going to use Python. I have covered all the commands you need to quickly setup the project below. Since the blog focuses on Unstructured, I assume you have basic Python knowledge. Please note that I am using Linux, so where necessary you can use the alternative commands for Mac/Windows.

mkdir unstructured-demo
cd unstructured-demo

python3 -m venv venv
source venv/bin/activate

# Install required packages
pip install unstructured "unstructured[pdf]" google-genai
πŸ’‘ Package Information

Why "unstructured[pdf]"?

The basic unstructured installation supports only a couple of file extensions such as html, json etc and does not cover pdf. My document is pdf, so we are going to need unstructured[pdf], thus we can avoid using a larger library as unstructured[all-docs].

What's google-genai?

Google AI Studio offers a free plan for using their API. We will be using the LLM to compare outputs later. What's the fun without an LLM, eh?

Google AI Studio Plan
⚠️ Troubleshooting Tip

The python version I initially tried was Python 3.13.2. But I ran into an error while trying to install unstructured[pdf].

ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (onnx)

I tried downgrading to Python 3.11 and then it worked.

Reference: https://github.com/onnx/onnx/issues/4376

STEP 2: Preparing the document for Parsing

I have prepared a document for this example (content directly taken from the official Dota2 website). It is of 8 pages and it's about two Dota 2 Heroes (Dota 2 is a game and one of my favourites). It is available as a drive link so you can use the same, or you can use your own document.

https://docs.google.com/document/d/1-XjHYMiSa3ek1LXLz0SsUdkAV8qDz9Kl5gfjrUCzNiQ/edit?usp=sharing

Optional: Downloading the Drive Document Programmatically

πŸ’‘ Note

You can skip this part and just download it manually and place it in your working directory.

Create download_file.py:

import requests


def download_file_from_google_drive(file_id, destination):
    URL = f"https://drive.google.com/uc?id={file_id}"

    response = requests.get(URL)
    if response.status_code == 200:
        with open(destination, "wb") as f:
            f.write(response.content)
        print(f"File downloaded successfully to {destination}")
    else:
        print("Failed to download file.")


if __name__ == "__main__":
    file_id = "1jehJZbVojANGXRpZJKYcT2c05xIlugxO"
    destination = "dota2_heroes.pdf"

    download_file_from_google_drive(file_id, destination)

Run python download_file.py and then you will have a file called dota2_heroes.pdf inside your project.

πŸ’‘ Google Drive Links

Google Drive viewer links won't download directly.

The document link is (It is a PDF):

https://drive.google.com/file/d/1jehJZbVojANGXRpZJKYcT2c05xIlugxO/view?usp=sharing

I need to extract the file ID & create a direct download link:

https://drive.google.com/uc?id=1jehJZbVojANGXRpZJKYcT2c05xIlugxO

This way, I can programmatically download it.

STEP 3: Parsing the Downloaded Document

Create parse_doc.py & run python parse_doc.py:

from unstructured.partition.pdf import partition_pdf

doc_path = "dota2_heroes.pdf"

# Parse the document
elements = partition_pdf(filename=doc_path)

# Print parsed elements
for element in elements:
    print(f"{element.category}: {element.text[:200]}...\n")

Now you can see a summary of each parsed chunk (up to first 200 characters) as output in your terminal. Note how it has identified titles, list items, narrative text, which is quite cleaner compared to raw PyPDF.

Terminal Output
Terminal Output

How Unstructured identifies formatting & categorizes
How Unstructured identifies formatting & categorizes

Exploring Parsed Elements with Unstructured

When you run partition_pdf(), Unstructured doesn't just give you plain text β€” It gives you a list of Element objects, each carrying useful metadata and category information.

Each element has properties like category, text and page_number.

from unstructured.partition.pdf import partition_pdf

doc_path = "dota2_heroes.pdf"
elements = partition_pdf(filename=doc_path, strategy="fast")

for element in elements:
    print(f"Category: {element.category}")
    print(f"Page: {getattr(element.metadata, 'page_number', 'Unknown')}")
    print(f"Text: {element.text[:200]}...\n")  # Limit to first 200 characters

print(f"Total number of elements: {len(elements)}")

Parsed Elements Output
Parsed Elements Output


Why is This Useful?

βœ… Benefits of Structured Parsing
  • You can filter only specific parts easily. (Example: Get all Titles)
  • You can analyze page by page
  • You can reconstruct a clean structured version of the PDF
  • Later, you can build your own custom chunks (like splitting by Title!)

Example

When you have metadata like page_number or a custom document index, you can filter chunks at retrieval time β€” without re-parsing or manually splitting documents.

For example, in a RAG (Retrieval Augmented Generation) system using Langchain's RetrievalQA, you can apply a filter like this:

from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={
            "k": blueprint_k,           # How many results to return
            "lambda_mult": 0.25,         # Balancing relevance/diversity
            "filter": {"page_number": 1},  # Filter only relevant docs!
        },
    ),
    chain_type_kwargs={"prompt": prompt},
)

Limitations: Table Parsing with the Free Library

While the Unstructured library performs well for most structured documents, it currently struggles with complex tables.

The table
The table

The table is not correctly parsed
The table is not correctly parsed

As you can see, the row/column structure is lost. This happens because the free library doesn't yet support advanced table detection or OCR alignment. If you like, you can try the Unstructured Paid version and see how it tackles this. They offer a free one week trial currently.

Summary

In this blog, we:

  • Set up the project environment
  • Parsed a real-world PDF document
  • Discussed parsing issues and limitations
  • Got a clean, usable text output ready for LLMs

Next Step? πŸ“š

You can try manually grouping up all the chunks so you can later feed into a RAG system, according to different strategies, by still relying on the Unstructured free version. In the next blog, we'll take this parsed data and build different chunking strategies (page-wise, overlapping pages, custom-based) β€” and compare how it impacts LLM answers!