Data Scraping from PDF files – Tabula

Scraping data from PDF files used to be a bit more complicated than scraping data from web pages. However, now there are user-friendly tools that make this task quite easy. One of the most popular open source software that allows you to scrap data from PDF files is Tabula. You can download the software for free from the web, and there are options for Windows and Mac machines.

screen-shot-2017-01-30-at-12-09-03-pm

After you download Tabula, you need to find it on you machine and open the software. It will look like the image bellow: screen-shot-2017-01-30-at-12-12-08-pm

Before you import your file in Tabula, you will need to save the pdf file on your computer. Then, click “Browse” and locate your pdf file containing your data on your computer and then click “import”. For example: locate and import a PDF file with data on “School Performance for the State of Texas.”

screen-shot-2017-01-26-at-5-18-03-pm

The next step is to find and select the table you want to export by clicking the top left corner and dragging the mouse to the bottom right corner, until all of the data is included in the shade selection area.

Screen Shot 2017-01-26 at 5.21.10 PM.png

A window then will appear containing your data. Inspect the data to make sure it looks correct. If data is missing, you might have to slightly expand your selection.  Click the download button.

screen-shot-2017-01-26-at-5-22-55-pm

Now you can export this data into an excel file and work with your data in the spreadsheets instead of PDF files. NOTE: Tabula works on text-based PDF, not scanned documents!!

Another popular software for scraping data from PDF files is PDFTables. PDFTables has an API (Application Programming Interface) so that programmers can integrate PDF data extraction into your operations. Some coding is required to do this.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s