Monday, May 16, 2011

Reading PDF and converting to CSV using RubyCode & pdf-reader plugin

I wanted to convert a PDF document into a XLS table and after a couple of searches I could easily able to write the code in ruby and converted a Citibank PDF Statement to CSV file. This gave me enough relief that I learnt how to read a PDF file if its not password protected.

require 'rubygems'
require 'pdf/reader'
class PageTextReceiver
  attr_accessor :content

  def initialize
    @content=[]
    @kk = "false"
    @i = 0
    @ptr_str = ""
  end

  def begin_page(arg=nil)
    puts ""
  end

  def show_text(string, *params)
    if string.strip=="Previous Balance"
      @kk="false"
    end
    if @i==4
      puts @ptr_str
      @ptr_str = ""
      @i=0
    end
    if string.strip=="Sale Date" or @kk == "true"
      @kk="true"
      if @i==0
        @ptr_str << string + "/2009,"
      else
        @ptr_str << string + ","
      end
      if (string.reverse.index(".")==2 or string=="Amount (in Rs)")
        @i=4
      else
        @i=@i+1
      end
    end
  end

  def move_to_next_line_and_show_text
    @i=0
    show_text
  end

  alias :super_show_text :show_text
  alias :set_spacing_next_line_show_text :show_text

  def show_text_with_positioning(*params)
    params=params.first
    params.each { |str| show_text(str) if str.kind_of?(String) }
  end
end

receiver = PageTextReceiver.new
(1..45).each do | x |
  pdf = PDF::Reader.file("#{x}.pdf", receiver)
  puts receiver.content.inspect
end

The above code use to read 45 pdf files. Say the above code is saved in read_pdf.rb

Below is the command to execute the file and store in a file (which I hope the easiest way)
ruby read_pdf.rb >> a.csv
hope it helps you ? or If you know how to read a PDF which is password protected thru code, where I can input the password of the file, please let me know.

No comments:

Post a Comment