Complex harvesting for content from public sources and email
VALA2014 CONCURRENT SESSION 2: It's All About the Data
Prosentient Systems, NSW
Please tag your comments, tweets, and blog posts about this session: #vala14 and #s6
|View the presentation VALA2014 Session 2 Balnaves on the VALA2014 GigTV channel|
This paper presents the results of a project for complex harvesting system from web and email sources integrated with open source platforms to improve discovery of information about or relevant to the organisation from public internet sources. The paper discusses methods of harvesting, drawing on a mix of RSS, Google API search and simple web parsing. The paper presents the results of automated metadata allocation and subsequent manual curation. The project highlights the need to use multiple web scanning techniques, so as to be sufficiently exhaustive to catch relevant references, but also sufficiently specific to avoid unduly large false positive candidates for selection.
This work is licensed under a Creative Commons Attribution-NonCommercial License.