Portia is an excellent visual tool for those who are not very well vesred with xpath or regexp, but want to extract data from unstructured html. I haven’t seen something like this in open domain at all. There are several tools but all of them require some programing knowledge.
Portia works on extremely popular opensource scrapy tool. However scrapy itself is difficult to learn for someone non technical like me. Portia however is very easy to use, the only technical part is installing it.
– Download the source code
– Install using pip ( there is a readme file that takes you step by step through the installation)
cd slyd pip install -r requirements.txt
requirements.txt has all the dependencies that should get installed automatically. But in my case they didn’t. I had to first install pip , then twisted server, git and finally scrapely all manually. Once this was done, the tool installed without any problem.
Installation is the only part that would take most of the time.
– First you would use the visual tool called slyd. Run slyd using
cd slyd twistd -n slyd
– Point your browser to http://localhost:9001/static/main.html. Visit the page you want to scrape and annotate it using visual markup tool. This is fantastic you can point and click, tagging each field, telling slyd which data is to be associated with which field. You can even use regexp if you want.
– The project is automatically created for you ( as you can see while you annotate). Now you need to run the project in command line though portiacrawl command.
Go to slyd/projects/projectname and run the following command ( formats accepted are xml, jsonline, csv,pickle, marshal)
portiacrawl project_path spidername -o outputfilename -t format
All the scraped data will be output file.