pathikgandhi
Back to all posts
·5 min read

Introducing expo-pdf-text-extract: Native PDF Text Extraction for React Native & Expo

A free, open-source Expo module that extracts text from PDFs using native APIs. Works with Expo SDK 49+, full TypeScript support, processes documents in milliseconds.

TL;DR: A free, open-source Expo module that extracts text from PDFs using native APIs (iOS PDFKit + Android PDFBox). Works with Expo SDK 49+, full TypeScript support, processes documents in milliseconds. Install with npx expo install expo-pdf-text-extract → npmGitHub

If you’ve ever tried to extract text from a PDF in a React Native or Expo app, you know the frustration. You search npm, find a dozen packages that view PDFs, a few that convert them to images, and maybe one or two that claim to extract text but haven’t been updated since 2019.

I ran into this exact problem while building a React Native app that needed to parse PDF documents. After wasting hours evaluating options, I decided to build the solution I wished existed.

Meet expo-pdf-text-extract — a native Expo module that does exactly what the name says: extracts text from PDF files. Fast, simple, and open-source.

The Problem: PDF Text Extraction Shouldn’t Be This Hard

Here’s what the landscape looked like when I started:

PDF Viewer Libraries — Great for displaying PDFs, useless for getting the text out. WebView-based solutions can’t expose the underlying content to your JavaScript code.

JavaScript-Only Solutions — Libraries like pdf.js work, but they're slow. Parsing a multi-page document in JS means blocking your thread and watching your app stutter.

Commercial SDKs — PSPDFKit, PDFTron, and others offer excellent functionality. They also cost hundreds or thousands per month. Overkill when you just need the text.

OCR Libraries — If you’re extracting text from scanned documents (images), OCR makes sense. But most PDFs are digital — the text is already embedded. Running OCR on them is like taking a photo of your screen to copy text instead of pressing Ctrl+C.

What I needed was simple: call a function, pass a file path, get the text back. No subscriptions, no complexity, no performance penalties.

How expo-pdf-text-extract Works

Digital PDFs store text as structured data. When you open a PDF in any reader, it’s not rendering an image — it’s laying out actual text characters at specific coordinates. Both iOS and Android have built-in APIs to access this text layer.

expo-pdf-text-extract leverages these platform-native APIs:

  • iOS: Apple’s PDFKit framework (built into iOS, no external dependencies)
  • Android: Apache PDFBox library (the industry-standard open-source PDF library)

The native code extracts text directly from the PDF’s internal structure. No image processing, no machine learning, no network calls. A 10-page document typically processes in under 100 milliseconds.

This approach gives you 100% accuracy for digital PDFs because you’re reading the exact characters the document contains — not guessing what they might be.

Getting Started

Installation

npx expo install expo-pdf-text-extract

Note: This package requires Expo SDK 49+ and a development build. It won’t work in Expo Go because it includes native code.

If you haven’t created a development build before:

npx expo prebuild  
npx expo run:ios  # or run:android

Basic Usage

import { extractText, getPageCount, isAvailable } from 'expo-pdf-text-extract';
// Always check availability first  
if (!isAvailable()) {  
  console.log('PDF extraction not available on this platform');  
  return;  
}
// Extract all text from a PDF  
const handlePdfExtraction = async (filePath: string) => {  
  try {  
    const text = await extractText(filePath);  
    console.log('Extracted text:', text);  
      
    const pages = await getPageCount(filePath);  
    console.log(\`Document has ${pages} pages\`);  
  } catch (error) {  
    console.error('Extraction failed:', error);  
  }  
};

Extract Specific Pages

Need text from just the first page or a specific range? No problem:

import { extractTextFromPages } from 'expo-pdf-text-extract';
// Extract only pages 1-3 (1-indexed)  
const partialText = await extractTextFromPages(filePath, 1, 3);

Supported Path Formats

The package handles whatever path format you throw at it:

// Absolute paths  
await extractText('/var/mobile/.../document.pdf');
// file:// URIs (common from document pickers)  
await extractText('file:///path/to/document.pdf');
// content:// URIs (Android content providers)  
await extractText('content://com.android.providers/.../document.pdf');

Use expo-pdf-text-extract when:

  • Your PDFs are digitally created (invoices, policies, reports, receipts from most services)
  • You need fast, offline extraction
  • You want a simple, lightweight solution

Look elsewhere when:

  • Your PDFs are scanned images (you’ll need OCR)
  • You need to edit or annotate PDFs (consider a full-featured SDK)
  • You’re building for web only (pdf.js works well there)

What You Can Build With It

This package opens up a lot of possibilities for React Native apps:

  • Expense tracking — Extract data from invoices and receipts
  • Document management — Index and search PDF content locally
  • Form processing — Pull structured data from standardized PDF forms
  • Policy readers — Parse insurance or legal documents for key information
  • Receipt scanners — Read digital receipts from email attachments

I built this to solve my own problem — extracting and categorizing data from PDFs within my app. What used to be manual data entry now happens in milliseconds.

Try It Out

The package is fully open-source under the MIT license. You can:

If this solves a problem for you, I’d appreciate a ⭐ on GitHub. Found a bug or have a feature request? Open an issue — I’m actively maintaining this and genuinely want to hear what would make it more useful.

Sometimes the best tools are the simplest ones. Happy extracting!

Pathik Gandhi is a technology consultant and React Native developer. He builds things that solve problems he actually has, then shares them with the community. Find him on GitHub.

Also published on Medium