[Python][tesseract]画像ファイルを光学式文字認識（OCR）を使って文章抽出する

前回の続き。

OCRによる文章抽出において、Google Cloud Vision APIを使った場合とtesseractを使った場合を比較しようかなと思い試していました。

結果的に、精度は、Google Cloud Vision APIの方がよさげです。
ただ、Vision APIの場合、使用量によりお金がかかる可能性があることと、APIの先の処理が非公開なので、カスタマイズも難しいこともあり、どちらも試しただけで終わりました。

前回まで
tesseractとは
1. 環境構築
2. サンプルコード
その他、試したパッケージ
1. サンプルコード
参考（後日追加）

前回まで

[Python][Google Cloud Vision API]画像ファイルを光学式文字認識（OCR）を使って文章抽出する
https://blog.integrityworks.co.jp/2019/11/02/python-get-paragraphs-from-png-file/

[Python]PDFファイルをページ毎にpngへ変換する
https://blog.integrityworks.co.jp/2019/11/01/python-pdf-change-to-png/

tesseractとは

オープンソースで使用可能なOCR
https://tesseract-ocr.github.io/
https://github.com/tesseract-ocr/tesseract

Pythonで使う場合も、Pythonパッケージとしてインストールするのではなく、Windowsソフトとしてインストールして使いました。
※試したのは、Windows10。Mac環境では確認していないですが、同じくインストーラがあるっぽいので、ほぼ同じようにっできるかなと思います。

環境構築

ダウンロード先
https://github.com/tesseract-ocr/tesseract/wiki
tesseract-ocr-w64-setup-v5.0.0-alpha.20191030.exe (64 bit)

インストール後、環境変数を設定します。
インストール手順は、ここが参考になるかと思います。
https://gammasoft.jp/blog/tesseract-ocr-install-on-windows/

サンプルコード

ささっと書いたので、かなり雑なコードのままになっています。。。

import sys
import os.path
from PIL import Image
import pyocr
import pyocr.builders
import glob
import shutil
import cv2
from pdf2image import convert_from_path

import matplotlib.pyplot as plt

'''
tesseractの試作

解析するときのBuilderによって、結果が違う。
LineBoxBuilderが行単位で判定できるので、今回のニーズに一番マッチしている。
raw text : TextBuilder
words + boxes : WordBoxBuilder
lines + words + boxes : LineBoxBuilder


GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)
Tesseract Open Source OCR Engine (main repository) - tesseract-ocr/tesseract
github.com
ページセグメンテーションモード (PSM)
※公式情報が拾えていない
0 文字方向および書字系の検出 (Orientation and script detection: OSD) のみ
1 自動ページセグメンテーション（OSDありでOCR）.
2 自動ページセグメンテーション（OSDなし）
3 完全自動ページセグメンテーション（OSDなし） (Default)
4 単一カラムの様々なサイズのテキストとみなす
5 垂直方向に整列した単一カラムの均一ブロックテキストとみなす
6 単一カラムの均一ブロックテキストとみなす
7 画像を単一行のテキストとして扱う
8 画像を単語1つのみ含まれるものとして扱う
9 画像を円で囲まれた単語1つのみを含むものとして扱う
10 画像を1文字のだけが含まれるものとして扱う
11 Sparse text: 不特定の順序でできるだけ多くのテキストを探す
12 Sparse text: OSDあり
13 Raw line: 内部の処理をバイパスしつつ画像内にテキストが1行だけあるものとして扱う
'''

output_dir = "output_tes"
output_image_dir = output_dir + "/output_img"
input_image_dir = "Intermediate_img"

# tesseract周りの設定（別途Windows側でtesseractインストールなどの設定する必要あり）
tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    sys.exit(1)

tool = tools[0]
#print("Will use tool '%s'" % (tool.get_name()))
langs = tool.get_available_languages()
#print("Available languages: %s" % ", ".join(langs))
lang = langs[2]
#print("Will use lang '%s'" % (lang))

# 結果の出力用ディレクトリが存在していれば、クリアして再生成する
def output_setting():
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    if os.path.exists(output_image_dir):
        shutil.rmtree(output_image_dir)

    os.makedirs(output_dir)
    os.makedirs(output_image_dir)

##### 1. TextBuilderの取得
#builder = pyocr.builders.TextBuilder(tesseract_layout=4, cuneiform_dotmatrix=True,
#                 cuneiform_fax=True, cuneiform_singlecolumn=True)
# txt = tool.image_to_string(Image.open('test.png'),
#                                 lang=lang,
#                                 builder=builder)
# with open("result2.txt", 'w', encoding='utf-8') as file_descriptor:
#     builder.write_file(file_descriptor, txt)

##### 2. WordBoxBuilderの取得
# builderW = pyocr.builders.WordBoxBuilder(tesseract_layout=4)
# word_box = tool.image_to_string(Image.open('test.png'),
#                                 lang=lang,
#                                 builder=builderW)
#
# with open("resultw.txt", 'w', encoding='utf-8') as file_descriptor_w:
#     builderW.write_file(file_descriptor_w, word_box)

##### LineBoxBuilderの取得　★これがよさげ
def analyse_image_to_line(image_file, index):
    builder_line = pyocr.builders.LineBoxBuilder()
    box_lines = tool.image_to_string(Image.open(image_file),
                                lang=lang,
                                builder=builder_line)
    # 結果を保存（HTML形式）
    name, ext = os.path.splitext(os.path.basename(image_file))
    name_slice = name[0:len(name)-2]
    save_res = output_dir + "/res_tes_line_" + name + ".txt"
    with open(save_res, 'w', encoding='utf-8') as file_descriptor_line:
        builder_line.write_file(file_descriptor_line, box_lines)

    save_text = output_dir + "/res_tes_text_" + name_slice + ".txt"
    with open(save_text, 'a', encoding='utf-8') as file_descriptor_text:
        for line in box_lines:
            file_descriptor_text.write(line.content + "\r")
        file_descriptor_text.write("-----------\r")

    out = cv2.imread(image_file)
    for line in box_lines:
        cv2.rectangle(out, line.position[0], line.position[1], (0, 255, 0), 5)
    # こちらの方法だと、ひとオブジェクト毎に色が変わるため、判別しやすい
    #     x_plot = [l.position[0][0], l.position[1][0], l.position[1][0], l.position[0][0], l.position[0][0]]
    #     y_plot = [l.position[0][1], l.position[0][1], l.position[1][1], l.position[1][1], l.position[0][1]]
    #     plt.plot(x_plot, y_plot)
    #
    # plt.savefig("line_box.png")
    # plt.clf()

    output_img_name = output_image_dir + "/" + os.path.basename(image_file)
    cv2.imwrite(output_img_name, out)

output_setting()
image_list = glob.glob(input_image_dir + "/*.png")
for index, img_file_name in enumerate(image_list):
    analyse_image_to_line(img_file_name, index)

その他、試したパッケージ

Pythonには、以下のようなパッケージもありました。
https://pypi.org/project/gpyocr/

Python wrapper for Tesseract OCR and Google Vision OCR to perform OCR on images and get a confidence value of the results.

サンプルコード

import gpyocr

text, conf = gpyocr.tesseract_ocr('test.png', lang='jpn', psm=6)
print(text)

aaa, confidence = gpyocr.google_vision_ocr('test.png', langs=['ja'])

参考（後日追加）

PythonとTesseract OCRで文字認識

PythonとTesseract OCRで文字認識 - Qiita

概要 Pythonの勉強をしている時に良い題材がないかを調べている際、文字認識について興味があったので一緒に使って勉強しようと思いました。オープンソースで使用可能なOCRはTesseract OCRが優秀だということでこちらを使ってみたい...