Scraping data from PDF files used to be a bit more complicated than scraping data from web pages. However, now there are user-friendly tools that make this task quite easy. One of the most popular open source software that allows you to scrap data from PDF files is Tabula. You can download the software for free from the web, and there are options for Windows and Mac machines.
After you download Tabula, you need to find it on you machine and open the software. It will look like the image bellow:
Before you import your file in Tabula, you will need to save the pdf file on your computer. Then, click “Browse” and locate your pdf file containing your data on your computer and then click “import”. For example: locate and import a PDF file with data on “School Performance for the State of Texas.”
The next step is to find and select the table you want to export by clicking the top left corner and dragging the mouse to the bottom right corner, until all of the data is included in the shade selection area.
A window then will appear containing your data. Inspect the data to make sure it looks correct. If data is missing, you might have to slightly expand your selection. Click the download button.
Now you can export this data into an excel file and work with your data in the spreadsheets instead of PDF files. NOTE: Tabula works on text-based PDF, not scanned documents!!
Another popular software for scraping data from PDF files is PDFTables. PDFTables has an API (Application Programming Interface) so that programmers can integrate PDF data extraction into your operations. Some coding is required to do this.