[Python]tabulaで表抽出したら、空の表や、空のセルがむっちゃでたので、無理やり対応した思い出

pdfから表を抽出するライブラリを使っていた時の話。

tabulaを使っていたけど、表抽出って汎用的にやるのが難しいので、どうしてもターゲットになる表に対して、力業で読めるようにしてあげる必要がある。

以下のようなイメージで取得したところ、コメントにあるように、空セルのみの表が取れるケースが出てしまった。

import tabula

dfs = tabula.read_pdf(INPUT_P,
        multiple_tables=True,
        area=[5, 5, 95, 95],
        relative_area=True,
        lattice=True,)
for index_table, df in enumerate(dfs):
    # このままでは、5x5ですべて空セルの表がdfとして出力されてしまったりした
    df_tabula.to_csv("output_" + str(index_table) + ".csv")

※パラメータは適当

空セルと記載したが、pandasにおいては欠損値NaNとして扱われているため、全セルがNaNかどうかで判定しました。

・修正後

import tabula

dfs = tabula.read_pdf(INPUT_P,
        multiple_tables=True,
        area=[5, 5, 95, 95],
        relative_area=True,
        lattice=True,)
for index_table, df in enumerate(dfs):
    if df.size < 1 or df.size == df.isnull().values.sum():
        continue
    df_tabula.to_csv("output_" + str(index_table) + ".csv")

index_tableが飛び飛びになるのは別途考える。