Complex harvesting for content from public sources and email

VALA2014 CONCURRENT SESSION 2: It's All About the Data
Tuesday 4 February 2014, 12:00 - 12:30
Persistent URL: http://www.vala.org.au/vala2014-proceedings/vala2014-session-2-balnaves

Edmund Balnaves

Prosentient Systems, NSW

Please tag your comments, tweets, and blog posts about this session: #vala14 and #s6

vala2014-logo-2
VALA Peer Reviewed

pdf  VALA2014-Session-2-Balnaves-Paper (156.54 kB)  

View the presentation View the presentation VALA2014 Session 2 Balnaves on the VALA2014 GigTV channel

Abstract

This paper presents the results of a project for complex harvesting system from web and email sources integrated with open source platforms to improve discovery of information about or relevant to the organisation from public internet sources. The paper discusses methods of harvesting, drawing on a mix of RSS, Google API search and simple web parsing. The paper presents the results of automated metadata allocation and subsequent manual curation. The project highlights the need to use multiple web scanning techniques, so as to be sufficiently exhaustive to catch relevant references, but also sufficiently specific to avoid unduly large false positive candidates for selection.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial License.

 

Contact Us

iconVALA 
PO Box 443,
Warrandyte Victoria 3113

icon Contact VALA here.

icon +61 3 9844 2933

icon www.vala.org.au

Go to top