Logo of Humboldt-Universität zu BerlinLogo of Humboldt-Universität zu Berlin
edoc-Server
Open-Access-Publikationsserver der Humboldt-Universität
de|en
Header image: facade of Humboldt-Universität zu Berlin
View Item 
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • Everything Changes, Everything Stays the Same? Understanding Information Spaces. Proceedings of the 15th International Symposium of Information Science (ISI 2017)
  • View Item
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • Everything Changes, Everything Stays the Same? Understanding Information Spaces. Proceedings of the 15th International Symposium of Information Science (ISI 2017)
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.
All of edoc-ServerCommunity & CollectionTitleAuthorSubjectThis CollectionTitleAuthorSubject
PublishLoginRegisterHelp
StatisticsView Usage Statistics
All of edoc-ServerCommunity & CollectionTitleAuthorSubjectThis CollectionTitleAuthorSubject
PublishLoginRegisterHelp
StatisticsView Usage Statistics
View Item 
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • Everything Changes, Everything Stays the Same? Understanding Information Spaces. Proceedings of the 15th International Symposium of Information Science (ISI 2017)
  • View Item
  • edoc-Server Home
  • Tagungs- und Konferenzbände
  • Everything Changes, Everything Stays the Same? Understanding Information Spaces. Proceedings of the 15th International Symposium of Information Science (ISI 2017)
  • View Item
2017-03-24Konferenzveröffentlichung DOI: 10.18452/1447
news-please
A Generic News Crawler and Extractor
Hamborg, Felix
Meuschke, Norman
Breitinger, Corinna
Gipp, Bela
Philosophische Fakultät
The amount of news published and read online has increased tremendously in recent years, making news data an interesting resource for many research disciplines, such as the social sciences and linguistics. However, large scale collection of news data is cumbersome due to a lack of generic tools for crawling and extracting such data. We present news-please, a generic, multilanguage, open-source crawler and extractor for news that works out-of-the-box for a large variety of news websites. Our system allows crawling arbitrary news websites and extracting the major elements of news articles on those websites, i.e., title, lead paragraph, main content, publication date, author, and main image. Compared to existing tools, news-please features full website extraction requiring only the root URL.
Files in this item
Thumbnail
hamborg.pdf — Adobe PDF — 244.2 Kb
MD5: b8b83c1f110212179ae8e733cb8bfb4f
Cite
BibTeX
EndNote
RIS
InCopyright
Details
DINI-Zertifikat 2019OpenAIRE validatedORCID Consortium
Imprint Policy Contact Data Privacy Statement
A service of University Library and Computer and Media Service
© Humboldt-Universität zu Berlin
 
DOI
10.18452/1447
Permanent URL
https://doi.org/10.18452/1447
HTML
<a href="https://doi.org/10.18452/1447">https://doi.org/10.18452/1447</a>