Scraping page using Portia : Annotating Pages and Running Spiders

The last article I wrote was about installing portia. Here is the second part, explaining how to use portia.

Scraping the Webpages with Portia is a 2 step process. First you annotate a page , telling portia where to look for your data and how to extract it. Second, you run the spider that actually extracts the data.

Step 1 : Annotating the web pages with Portia

Portia Annotation is a front end tool for Scrapely . You train it once telling it the format of the data you want to extract, and it saves it in a template. Generally each element on the page has a tag-id and without knowing any regexp you can annotate pages.

Main page shows you a place to enter the website you wish to scrape, has a side menu ( pin it up so its visible all the time) and a home button that you can hit to see current projects. Each project is saved as new_project.

As soon as you enter the site and hit “+Start” button, a new project is created. Browse to the page you wish to scrape and click on “Annotate” button. Click on each field you wish to scrape and tag it with a field name and field type. You can create new fields on the fly, there are sever field types available like text, number, image, price, html, safe html , geopoint and URL.

Create a new field with the desired type.

You can also refine the selection of field further based on the HTML elements. Clicking on the gear icon in annotation area on the side menu shows you all the ancestors and child objects , as you can see in the following image.

Side Bar has all the options you will need to define your spider and your scraping template.

  • Annotations show you all the tagged fields and you can further refine it by clicking on the gear icon.
  • Extracted items let you change the type of the fields, and delete the fields already defined
  • Extractors lets you make use of powerful regular expressions to build customized tagging.
  • Under Required fields you can define which fields must exist for the record to be scraped.
  • Hit the save button and your spider is saved. it’s named after the website you are scraping.

Limitations of Annotating visually
Now as you may have noticed there is a tradeoff for all the ease you are getting. The scraping pattern is based on a relative position. So you can’t scrape for example text out of a  div that has an id of crumbs ( xpath(‘//*[@id=”crumbs”]/text()’). Rather you will be scraping something like 3rd div after form in html (/htm/form/div/div/div). If the data you are scraping is not ordered then it breaks. It would be nice to see an option to define xpath too like scrapy. Last week I tried scrapy and though there is hardcoding involved, it gives you much more powerful expressions to scrape.

Step 2: Running your Spider to crawl and Scrape the Data

In step 1, you created a spider and now its time to run it.

go slyd/data/project directory. Here you can see all your projects. Run

Portiacrawl <project name> <spider name> -t csv -o <outputfile.csv>

Your spider starts to crawl and dump the data to outputfile.csv in csv format.