Tesseract Guide

How to Use Tesseract OCR

Tesseract OCR is a free open source tool that extracts text from images and converts it into machine readable text, PDFs and more. It supports more than 100 languages and 37 scripts and is also trainable but requires a lot of complexity. Tesseract is a little difficult to use for a beginner who is new to the technicalities of text detection. Thus, we have created this guide to help you get the idea of how to use Tesseract OCR after downloading and installing it to get the text extracted from your images.

How to use Tesseract OCR - Guide Overview
Guide overview: using Tesseract OCR effectively

How to Use Tesseract OCR to Extract Text from an Image?

There are basically two methods to get the text out of your image using Tesseract OCR. One includes a Python wrapper which is best for applications. The other method is to use the Command Line which is best for quick or one time text extraction.

Method 1: Python Wrapper (pytesseract)

This is a standard method and also the best because it lets you clean up the image easily before recognizing it.

Step 1: Install the Python Tool

You need a pytesseract package, which is a Python wrapper that gives a Python interface to the Tesseract executable and the Pillow library to handle the image file. Pillow is the modern replacement for the old PIL (Python Imaging Library), but the import statement still uses from PIL import Image.

pip install pytesseract Pillow

Step 2: Write the Extraction Script

In a Python file (e.g, extract.py), use the image_to_string function. This is the simplest and easiest way to get the text output.

# from PIL import Image
import pytesseract

# Provide the name/path of your file image
image_path = "my_document.png"

# Open the image using Pillow
img = Image.open(image_path)

# Use pytesseract to extract the text
text = pytesseract.image_to_string(img)

# Print the final result
print(text)

Step 3: Run the Script

Run your Python file from your terminal:

python extract.py

Quick Note: If you see a TesseractNotFoundError, it means the Python script can't find the program even after installing it. In this case, you have to tell Python where the executable file is located. Use the setting: tesseract_cmd path

import pytesseract

# You have to change this path to match the downloaded location.
pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files/Tesseract-OCR/tesseract.exe'

We have covered more such troubleshooting methods for Tesseract OCR in our Tesseract Troubleshooting guide. You can check them out to get the best solution for your Tesseract issues.

Method 2: Command Line Interface (CLI)

This is the fastest method if you just want to convert a single image into a text file without writing any Python code.

Step 1: Open your terminal

Open the command prompt in Windows or the terminal in macOS or Linux.

Step 2: Use the basic command

Use the Tesseract command then provide the input image file and the name for the output file:

tesseract [Input Image Filename] [Output Filename]

For example if your image name is invoice.jpg, you can run this command:

tesseract invoice.jpg invoice_output

Step 3: Check the results

Tesseract will automatically create a file name: invoice_output.txt in the same directory with the extracted text.

Tip: If you want a high quality, searchable PDF instead of just a text file, add the word "pdf" at the end of your command:

tesseract invoice.jpg invoice_output pdf

This will create a file name invoice_output.pdf

Crucial Step

Your Tesseract output results totally depends on the image quality. If your image is blurry, or is rotated, or has shadows or noise, the result will not be as good as you think.

For best results, make sure the image you give is high quality and should meet these standards:

Correct Polarity: Use dark text on a light background. Invert the colors if your text white on black.

High Resolution: The image should be at least 300 DPI (dots per inch). Small images should be scaled up.

No Skew: The lines of the text must be straight. Tilted, curved or skewed images reduce the accuracy. You can also use some image tools to straighten the lines if they are not horizontal.

In Python, you can use libraries like Pillow or OpenCV to clean the image. For example: convert it to black and white, enhance the contrast and remove the noise before giving it to pytesseract.

Some Preprocessing Examples

From PIL import image, Image Filter
import pytesseract

# 1. Open and preprocess the image
img = Image.open("document.png")

# 2. Convert to greyscale
img = img.convert('L')

# 3. Apply slight blur to reduce noise
img = img.filter(ImageFilter.MedianFilter())

# 4. Extract the text from preprocessed image
text = pytesseract.image_to_string(img)
print(text)

In Short

Task Method Command/Code
Python extraction pytesseract pytesseract.image_to_string(img)
CLI extraction Command line tesseract input.jpg output
Create PDF Command line tesseract input.jpg output pdf
Set Tesseract path pytesseract pytesseract.tesseract_cmd = r'path/to/tesseract.exe'
Preprocess image Pillow img.convert('L') for grayscale