Portia is an excellent open source Visual web scraper. However installing portia is a daunting task for most of the non-linux type people. Guys at scrapinghub have tried to make it very simple to install but most of the time, due to dependencies the install is complex.
Here is what worked for me. I did it on a Ubuntu Machine, but it should work on any other debian based distro also where you have apt installed.
Part 1 : Installing Portia
Step 1. Install the dependencies
Run the following commands to get all the dependencies. This is what causes most of the trouble later.
Sudo apt-get install build-essential python-dev
Sudo apt-get install python-pip
Sudo apt-get install python-scrapy
Sudo apt-get install git
Step 2. Install Portia
Download portia from here, extract it and run the following command from slyd directory.
Sudo pip install -r ./requirements.txt
It will read from requirements. txt and probably tell you that a lot many requirements are already fulfilled ( as you already installed some requirements in step 1). It take some time to finish and would probably give you some warnings.
Step 3. Run Slyd
If everything went well, you should be able to start twisted web server with slyd. Go to Slyd directory and run
twistd -n slyd
If you get errors, there are some requirements missing. If not, you can see that an instance of twisted server has started.
Part 2 : Annotating Web page
1. Run slyd server. Go to slyd directory and run
twistd -n slyd
2. Point your browser to
This should show you the main Slyd page.
3. Input here the starting page.
4. Annotate the page. Start clicking and tagging the fields.
5. Save the project. By default all the projects are saved as newproject1,2 etc and inside that you have a crawler named after website .
Part 3 : Run your Spider and dump data to csv
Now that you have tagged a page, you can ask your spider to crawl, collect data and dump data in a file of your choice.
go slyd/data/project directory. Here you can see all your projects. Run
Portiacrawl <project name> <spider name> -t csv -o <outputfile.csv>
Your spider starts to crawl and dump the data to outputfile.csv in csv format.