Nov 23, 2012

Web scraping with lxml

From Evernote:

Web scraping with lxml

Clipped from: http://woz.thebigbluesky.us/a_curious_absurdity/2009/02/08/web-scraping-with-lxml/
python + lxml + xpath 乃网页分析的利器。xpath 格式参见 w3school 教程。
在 python 中写 xpath 语句时,注意用 "" 包含,因为里面可能有 ' '

Web scraping with lxml

Python and lxml make an excellent framework for scraping data from web pages.  The lxml library makes this especially easy with its support for running Xpath queries against HTML.  Say for example I wanted to get the rating of a movie from IMDB:

>>> imdb = etree.HTML( urllib.urlopen("http://www.imdb.com/title/tt0087332/").read() )  >>> imdb.xpath("//div[contains(@class,'rating')]//b/text()")  ['7.7/10']

Basically, this code just downloads the page using urllib, loads it into a document using lxml's HTML parser, and uses an Xpath expression to grab the rating out of the page.

 

The only slightly difficult part is coming up with with the Xpath query.  Firebug can help with this step.  When you inspect a page with Firebug you can right click on any node in the DOM structure and get an Xpath expression for it:

xpath

Grabbing an Xpath expression for a node from Firebug

Of course the expression that Firebug comes up with is usually far too specific:

/html/body/div/div[2]/layer/div[3]/div/div[3]/div[2]/div[4]/div/div[2]/div[2]/b

Web pages often change; furthermore, not all movie pages on IMDB are going to have the exact same HTML structure. This means that an exact query (like the one given by Firebug) will not work.

Lets take a look at the DOM structure again:

<div class="general rating">  <div class="starbar static">...</div>  <div class="meta">  <b>7.7/10</b>...

That div with a class="general rating" looks like a good starting point for the query, it is likely that the rating class is unique to the bit of data we are trying to get at. So search for any div that has 'rating' in its class:

//div[contains(@class,'rating')]

From there just ignore all of the fluff in-between and grab the actual rating:

//div[contains(@class,'rating')]//b/text()

Adding the /text() on the end spares us the step of extracting the string value from the node, lxml will just give us a string back.

I found that extracting a large amount of data from a page can get very messy, with tons of queries laying around in the code. So I wrote a function that takes an HTML document and a dictionary of Xpath expressions, and runs all of the expressions against the document returning a dictionary of the results. Store the dictionary in JSON format and you have a nice clean bit of configuration for extracting information from the web. Here is the JSON document for the IMDB example:

{  	rating: "//div[contains(@class,'rating')]//b/text()",  	cast: {  		each: "//table[@class='cast']/tr",  		actor: "td[@class='nm']//text()",  		character: "td[@class='char']//text()"  	}  }

And the result of running it against a page on IMDB:

{  	'rating': '7.7/10',  	'cast': [  		{'character': 'Dr. Peter Venkman', 'actor': 'Bill Murray'},  		{'character': 'Dr. Raymond Stantz', 'actor': 'Dan Aykroyd'},  		{'character': 'Dana Barrett', 'actor': 'Sigourney Weaver'},  		{'character': 'Dr. Egon Spengler', 'actor': 'Harold Ramis'},  		]  }

Basically, for every item in the dictionary the Xpath query found in the value is run against the root of the HTML document and the result is stored in a dictionary under the same key as the query. If the query only returns one result the value stored will only be that result, if the query returns an array you will get an array back. When the value is a dictionary and not an Xpath query, the query found under the 'each' key is used to select a group of objects and the rest of the queries in that dictionary are run with the node selected by 'each' as the root.

This function, and an example of its use can be found here.

UPDATE:
IMDB has blocked urllib from their site (if you try to fetch a page using urllib you get an error page in response). This basically breaks the entire example. Thanks to DJ for pointing out that it was broken. I do not have the time to rewrite the entire article but here is a different example:

   text2 = """  {  	quote: "//div[@id='ft']//small/text()",  	storys: {  		each: "//div[contains(@class,'fhitem-story')]",  		title: "h3//a[@class='datitle']/text()",  		comments: "div//span[@class='commentcnt']/a/text()"  	}  }  """  print docExtract( demjson.decode(text2), urllib.urlopen("http://www.slashdot.org").read() )     """  {u'storys': [  	{u'comments': '18', u'title': 'Dutch Hotels Must Register As ISPs'},  	{u'comments': '42', u'title': 'When You Really, Really Want to Upgrade a Tiny Notebook'},  	{u'comments': '163', u'title': 'Canon Blocks Copy Jobs Using Banned Keywords'}   u'quote': 'Too much is not enough.'}  """   

Tags: , , ,

This entry was posted on Sunday, February 8th, 2009 at 6:44 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.


0 comments: