Experimenting with YQL â jobs aggregator

1 Jul, 2009

As I have blogged previously, during webDU conference Yahoo!’s YQL really caught my attention. I really think it is really cool how you can leverage your SQL and XPath knowledge to easily query any websites that you want.

I can think of at least one application of this technology. Imagine that you have a company that you’d like to work for. You go to the site, great it has a job section, but you don’t see any position that matches your profile. So you check again few weeks later and again and again. Well imagine if there are more than one companies that you want to keep watching – the manual process can be very tedious. Of course, if these companies provide RSS feeds on their jobs section, we can simply use our RSS aggregator to do our job, however many of these companies (if not most) don’t provide RSS feed on their jobs section.

And this is where I think YQL can be of help. You can simply create your own jobs aggregator for different sites using YQL.

In this example, I built a ColdFusion template to “watch” jobs openings on UNSW and Gruden. Well, actually the site that I would really to aggregate is the Whirlpool’s Job Board – but querying the site using YQL isn’t possible due to its robots.txt denies crawlers (Simon, give us an RSS feed, please :) ).

Below is the code: [coldfusion] <cfsilent> <cfset yahooURL = "http://query.yahooapis.com/v1/yql?"> <!— TODO: need to work out oAuth - the public URL will have to do for the moment —> <cfset yahooURL = "http://query.yahooapis.com/v1/public/yql?">

&lt;cfset lCompanies = &quot;UNSW,gruden&quot;&gt;

&lt;!--- START | Prepare the YQLs ---&gt;
&lt;cfsavecontent variable=&quot;UNSWYQL&quot;&gt;
	SELECT * FROM html
	WHERE
		url IN (
		 	&quot;http://www.hr.unsw.edu.au/services/recruitment/newjobres.html&quot; 
			, &quot;http://www.hr.unsw.edu.au/services/recruitment/newjobgen.html&quot;
			, &quot;http://www.hr.unsw.edu.au/services/recruitment/newjobaff.html&quot;
		)  
		AND xpath='//a[@class=&quot;iw_tst&quot;]'	
&lt;/cfsavecontent&gt;

&lt;!--- Cannot get data from whirlpool job board - crawl is restricted by robots.txt
&lt;cfsavecontent variable=&quot;WHIRLPOOLYQL&quot;&gt;
	SELECT * FROM html
	WHERE
		url = &quot;http://whirlpool.net.au/jobs/?state=NSW&quot; 
		AND xpath='//div[@id=&quot;jobs&quot;]/li[@class=&quot;&quot;]'	
&lt;/cfsavecontent&gt;
---&gt;

&lt;cfsavecontent variable=&quot;grudenYQL&quot;&gt;
	SELECT * FROM html
	WHERE
		url = &quot;http://gruden.com/index.cfm/p/jobs&quot; 
		AND xpath='//div[@id=&quot;gspot&quot;]/ul/li/a'	
&lt;/cfsavecontent&gt;
&lt;!--- END | Prepare the YQLs ---&gt;

&lt;!--- Run Query
	1. Loop to call Yahoo! URL for each companies
	2. Save the result in variable called #company#Result ie: UNSWResult	
---&gt;
&lt;cfloop list=&quot;#lCompanies#&quot; index=&quot;company&quot;&gt;
	&lt;cfset theYQL = Evaluate( company &amp; &quot;YQL&quot; )&gt;
	&lt;cfset theURL = &quot;#yahooURL#q=#Trim( theYQL )#&quot;&gt;
	
	&lt;cfhttp
		url 	= &quot;#theURL#&quot;
		result 	= &quot;st#company#Result&quot;&gt;
	&lt;/cfhttp&gt;
	
	&lt;cfset variables[ &quot;#company#Result&quot; ] = variables[ &quot;st#company#Result&quot; ].fileContent&gt;
&lt;/cfloop&gt;

&lt;!--- Transform the data ( good place to use XSLT here )---&gt;
&lt;!--- UNSW ---&gt;
&lt;cfset UNSWXML = XMLParse( UNSWResult )&gt;
&lt;cfset arrUNSWResult = XMLSearch( UNSWXML, &quot;//results/a&quot; )&gt;

&lt;!--- Gruden ---&gt;
&lt;cfset grudenXML = XMLParse( grudenResult )&gt;
&lt;cfset arrGrudenResult = XMLSearch( grudenXML, &quot;//results/a&quot; )&gt;

</cfsilent> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Yahoo! Query Language Demo using the all powerful ColdFusion</title> </head>

<body> <cfoutput> <h1>YQL - It’s like SELECT * FROM THE INTERNET DUDE!</h1>

&lt;h2&gt;UNSW&lt;/h2&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;cfloop from=&quot;1&quot; to=&quot;#ArrayLen( arrUNSWResult )#&quot; index=&quot;index&quot;&gt;
	&lt;cfset jobTitle = arrUNSWResult[ index ][ &quot;XMLText&quot; ]&gt;
	&lt;cfset jobLink = arrUNSWResult[ index ][ &quot;XMLAttributes&quot; ][ &quot;href&quot; ]&gt;
	&lt;li&gt;&lt;a href=&quot;http://www.hr.unsw.edu.au/#jobLink#&quot;&gt;#jobTitle#&lt;/a&gt;&lt;/li&gt;	
&lt;/cfloop&gt;
&lt;/ul&gt;
&lt;/p&gt;

&lt;h2&gt;Gruden&lt;/h2&gt;
&lt;p&gt;
&lt;ul&gt;
&lt;cfloop from=&quot;1&quot; to=&quot;#ArrayLen( arrGrudenResult )#&quot; index=&quot;index&quot;&gt;
	&lt;cfset jobTitle = arrGrudenResult[ index ][ &quot;XMLText&quot; ]&gt;
	&lt;cfset jobLink = arrGrudenResult[ index ][ &quot;XMLAttributes&quot; ][ &quot;href&quot; ]&gt;
	&lt;li&gt;&lt;a href=&quot;http://gruden.com/index.cfm/p/jobs#jobLink#&quot;&gt;#jobTitle#&lt;/a&gt;&lt;/li&gt;	
&lt;/cfloop&gt;
&lt;/ul&gt;	
&lt;/p&gt;

</cfoutput> </body> </html>

[/coldfusion] And here is the snapshot of the result:

For some reasons the code doesn’t work on Railo though – I got “The markup in the document preceding the root element must be well-formed.” error on the XMLParse bit.. oh well that’s thought for another day.