Scrape Web Content into Google Sheets

The way in which we consume information on the web has changed considerably with the advent of RSS and other types of feeds. We no longer have to periodically navigate to news sites to check on breaking news, for example, because we have a feed reader to do it or us.

Unfortunately, not all web content is made available in this manner. As an example, some investment products have variable prices that can’t be read directly via the GoogleFinance() function in the manner I discussed here.

Google sheets, however, offers the importXML() function, which grabs XML data from other locations. We can use the fact that HTML is a form of XML (not strictly true, but near enough) to grab relevant pieces of data from websites that don’t offer the information by any more convenient means.

As an example, let’s grab the current value of the Virgin Income investment fund. A quick web search locates a page offering the fund value:

fundprice

Click image to enlarge

Examining the page source determines that the element we want to scrape is a span element with class=”bid price-divide”. A quick check of the code reveals that this is the only element of that class.

Now we can use the importXML() function to grab the content:

=importXML("http://[URL of page]","//span[@class='bid price-divide']")

The second parameter passed to the function is known as an xpath query – for the purposes of this example, the syntax:

//span[@class='bid price-divide']

simply grabs the content of a span element with the given class attribute. You can use more complex xpath queries if you need to; there’s information here on how it all works.

Here’s the function in action:

fundprice2

Click image to enlarge

Of course, to use this figure in calculations you’d have to use Google Sheet’s other functions to remove the ‘pence’ symbol and format the result as a number.