How to Perform OCR with Multi Modal LLMs

Optical Character Recognition (OCR) technology has made significant strides recently, especially with the introduction of Large Language Models (LLMs) that understand and interpret complex visual data called multi modal LLMs.

In this tutorial, I'll explore a proof-of-concept OCR project that utilizes the vision capabilities of multi-modal LLMs—namely, OpenAI's GPT-4 with vision or Anthropic's Claude 3—in conjunction with AWS Rekognition to accurately and efficiently extract line items from grocery receipts. This project showcases the potential to transform text within images into JSON with unparalleled precision and speed.

All the code mentioned in this blog post is available in the companion GitHub repository.

Prerequisites

Before we start, ensure you have the following prerequisites ready:

Node.js v20 installed on your system
Familiarity with TypeScript
An AWS account with access to AWS Rekognition

Installation Steps

Clone the Repository

First, clone the repository to your local machine using git:

git clone git@github.com:stephenc222/example-ocr-with-multi-modal-llms.git

Navigate to the Project Directory

Change your current directory to the project folder with cd.

Install Dependencies

Run npm install within the project directory to install all required dependencies, including the Anthropic AI SDK, OpenAI SDK, AWS SDK, and TypeScript.

Configure Environment Variables

Create a .env file at the root of your project and input your API keys for Anthropic AI and OpenAI, along with your AWS credentials. Use the example.env file as a guide.

Running the Project

To run the project, simply execute npm start. This command will trigger the script specified in package.json, starting the TypeScript Node.js application.

The application then processes a predefined image of our paper grocery store receipt to perform OCR, using Claude-3 (or GPT-4 if you don't have Claude 3 access, which you can learn about how to get here) and AWS Rekognition to extract and output the line items from the grocery receipt image to the console.

Understanding the Code

This blog post won't cover the details of the implementations of the rekognitionService, llmService and the other classes we are using for the sake of brevity, but you can find the full implementation of each class in the companion GitHub repository.

We will however be going over the root of the project, the src/index.ts file. Here is the full src/index.ts:

import fs from "fs"
import path from "path"
import { fileURLToPath } from "url"
import { LLMService } from "./services/llm"
import { RekognitionService } from "./services/rekognition"
import { ImageUtil } from "./utils/image"
import { createAIService } from "./factories/aiService"
import { z } from "zod"
import { IAIService } from "./types"

// This is a Node.js script, so we can use the fileURLToPath function to get the directory name
const __dirname = path.dirname(fileURLToPath(import.meta.url))

const rekognitionService = new RekognitionService()

// Use the factory function to create the aiService instance
const aiService: IAIService = createAIService()

const llmService = new LLMService(aiService)

/**
 *
 * @param imageData
 * @param prompt
 * @param schema
 * @returns
 */
async function detectAndProcessImage<T>(
  imageData: string,
  prompt: string,
  schema: z.ZodType<T>
) {
  try {
    const imageBuffer = Buffer.from(imageData, "base64")
    const areaOfInterest = await rekognitionService.findTextAreaOfInterest(
      imageBuffer
    )
    const processedImage = await ImageUtil.extractAndProcessImage(
      areaOfInterest,
      imageBuffer
    )
    return llmService.imageToJSON<T>(processedImage, prompt, schema)
  } catch (error) {
    console.error("Error detecting labels:", error)
  }
}

async function main() {
  // Example usage
  const exampleImagePath = path.resolve(
    __dirname,
    "images",
    "small_test_receipt.jpg"
  )
  const exampleImageFileData = fs
    .readFileSync(exampleImagePath)
    .toString("base64")

  // example grocery store receipt prompt
  const exampleGroceryPrompt =
    "Please analyze this paper store receipt and return a JSON object " +
    'containing an array of line items. The array key is "items". Each line ' +
    "item should be an object with two properties: 'name' for the item's name " +
    "and 'price' for its price. Exclude categories and only include specific " +
    "item entries. Only respond with a raw JSON string, no markdown and do not " +
    "escape '\"'."
  // example schema, for our grocery store receipt:
  const ExampleGroceryStoreReceiptSchema = z.object({
    items: z.array(
      z.object({ name: z.string(), price: z.number() }).required()
    ),
  })

  // pass example image data, prompt, and schema to the detectAndProcessImage function
  const output = await detectAndProcessImage<
    z.infer<typeof ExampleGroceryStoreReceiptSchema>
  >(
    exampleImageFileData,
    exampleGroceryPrompt,
    ExampleGroceryStoreReceiptSchema
  )

  console.group("Output Details")
  console.log("Result:", output?.result)
  console.log("Message ID:", output?.id)
  console.log("Message Role:", output?.role)
  console.log("Message Usage:", output?.usage)
  console.groupEnd()
}

try {
  main()
} catch (error) {
  console.error("Error in main:", error)
}

This OCR project hinges on several key entities that work in tandem to process images and extract text effectively. Here's an overview of these entities:

rekognitionService: This AWS Rekognition service client is specialized in identifying and cropping specific regions of text within an image, such as a grocery receipt from HEB. It plays a critical role in isolating the text area of interest to facilitate more accurate OCR.
llmService: Represents the cornerstone of this project, acting as an abstracted client for multi-modal Large Language Models (LLMs) like GPT-4 with vision and Claude 3. These models are capable of "reading" images, marking a significant evolution in AI capabilities.
llmService.imageToJSON: This function exemplifies the project's capability to transform the text found in images into structured JSON format. It ensures the data is validated against a predefined schema, enhancing the reliability and usability of the extracted information.
detectAndProcessImage: At the heart of the project, this function encapsulates the core logic, orchestrating the workflow from image processing to text extraction and JSON conversion.
ImageUtil: An abstraction for image processing utilities, this utility fine-tunes images by cropping and converting them to grayscale. This step is crucial for increasing the contrast, thereby improving the OCR accuracy by the multi-modal LLM.
main: This simple example function demonstrates the project's capability to read a local image file and process it through the described OCR workflow. We provide a prompt to give to the LLM for what kind of image we are giving it, and what we expect to get in terms of data structure. Our Zod schema we pass in enforces this structure for the data we expect to get back from the LLM, or throws an error if the data doesn't match the schema.

Workflow Deep Dive

The project kicks off with the import of necessary Node.js modules and services, setting the stage for OCR processing. Here's a simplified walkthrough of the workflow:

Image Preprocessing: Upon receiving an image, such as a grocery receipt, the rekognitionService identifies the text-containing region. The ImageService then processes this area, optimizing it for text recognition. Preprocessing an image is a traditional and essential step in OCR (and other computer vision tasks), and it's crucial for increasing the contrast and improving the OCR accuracy by the multi-modal LLM.
Text Extraction and Conversion: The preprocessed image is passed to the llmService, where the multi-modal LLM reads the image and extracts text. This text is then structured into JSON format based on a predefined schema, ensuring that the output is organized and adheres to specific data requirements.
Practical Application: To illustrate the project's utility, the main function demonstrates processing an example grocery receipt image. It showcases how the system can analyze the receipt and return a JSON object listing the purchased items along with their prices, structured as per the defined schema.

Limitations and Areas for Improvement

While the project showcases the innovative application of OCR with multi-modal Large Language Models, it's important to recognize its limitations and identify areas for improvement. Addressing these challenges can help enhance its applicability and efficiency in real-world scenarios. Here are some key limitations:

Simplicity of Test Images: The project currently focuses on processing simple grocery receipts, which are relatively small and straightforward images containing text. However, real-world applications often involve more complex, larger images, or documents containing multiple items or types of information. Adapting the project to handle such complexity would require significant enhancements in image processing and text extraction techniques.
Scalability to Larger Applications: As it stands, this project operates in a somewhat isolated environment. Integrating it into a larger application ecosystem, such as a REST API for web applications or a RabbitMQ Consumer for message-oriented middleware systems, would be necessary to fully leverage its capabilities in production. This integration would allow for more robust data processing pipelines and facilitate interaction with other services.
Batch Processing Capability: The current implementation is designed to process one image at a time. However, for practical applications, especially in commercial settings, the ability to handle batches of images simultaneously is crucial. Enhancing the project to include batch processing would significantly improve its efficiency and throughput, making it more suitable for high-volume scenarios.
Testing and Validation: Comprehensive testing is essential to ensure the reliability and accuracy of OCR projects. This encompasses not only unit tests for individual components but also integration tests that simulate real-world usage scenarios. Developing a robust testing framework that covers a wide range of image types, formats, and quality levels is necessary to validate the project's effectiveness across diverse conditions. Additionally, performance testing to assess the system's response times and resource utilization under load is critical for optimizing its deployment in production environments.

Addressing these limitations requires a thoughtful approach to system design, focusing on scalability, flexibility, and robustness. By expanding the POC's capabilities to process more complex images, integrate with larger systems, handle batch operations, and undergo rigorous testing, it can evolve into a more powerful tool that meets the demands of various practical applications.

Conclusion

This POC project not only demonstrates the practical application of OCR with multi-modal LLMs but also underscores the potential of these technologies in transforming data extraction processes. The ability to accurately process and extract information from images opens up numerous possibilities across various sectors, including retail, finance, and administrative automation.