OCR with LLM and Ruby

Edu Depetris

- Aug 03, 2024
  • Orc
  • Llm
  • Artificial Intelligence
  • Ruby
OCR stands for Optical Character Recognition, which is a way to extract text from an image.

This Tesseract OCR library is well-known for this purpose, and with the RTesseract Ruby wrapper, it is possible to implement text extraction from an image.

While writing this article, I learned that LLMs support images as input, so I decided to give it a try. In this brief article, we will perform OCR using an LLM.

I’ll use the same setup introduced in the previous article.

Images need to be encoded in base64 before being sent to the LLM, and we will ask for text extraction in our prompt.

Let’s write the necessary code.

For simplicity, let’s assume we have an image called article.png.

path = "documents/article.png"
image_contet = File.read(path)

encoded_base64_image = Base64.encode64(image_contet)

Now, let’s use the latest LLM from OpenAI:

llm = Langchain::LLM::OpenAI.new(api_key: ENV["OPENAI_API_KEY"], default_options: { chat_completion_model_name: "gpt-4o-mini" })

Next, we build our prompt and inject the image:

messages = [
  {
    role: "user",
    content: [
      {
        type: "text",
        text: "Extract the text from the image below and provide it without any additional explanations " \
              "or introductory ephrases."
      },
      {
        type: "image_url",
        image_url: {
          url: "data:image/png;base64,#{encoded_base64_image}"
        }
      }
    ]
  }
]

Let’s call the chat completions and check the result!

llm.chat(messages: messages).chat_completion

Here's the source code.

Happy Coding!