To read this content please select one of the options below:

Subject-based retrieval of scientific documents, case study: Retrieval of Information Technology scientific articles

Azadeh Mohebi (Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran)
Mehri Sedighi (Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran)
Zahra Zargaran (Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran)

Library Review

ISSN: 0024-2535

Article publication date: 5 September 2017

473

Abstract

Purpose

The purpose of this paper is to introduce an approach for retrieving a set of scientific articles in the field of Information Technology (IT) from a scientific database such as Web of Science (WoS), to apply scientometrics indices and compare them with other fields.

Design/methodology/approach

The authors propose to apply a statistical classification-based approach for extracting IT-related articles. In this approach, first, a probabilistic model is introduced to model the subject IT, using keyphrase extraction techniques. Then, they retrieve IT-related articles from all Iranian papers in WoS, based on a Bayesian classification scheme. Based on the probabilistic IT model, they assign an IT membership probability for each article in the database, and then they retrieve the articles with highest probabilities.

Findings

The authors have extracted a set of IT keyphrases, with 1,497 terms through the keyphrase extraction process, for the probabilistic model. They have evaluated the proposed retrieval approach with two approaches: the query-based approach in which the articles are retrieved from WoS using a set of queries composed of limited IT keywords, and the research area-based approach which is based on retrieving the articles using WoS categorizations and research areas. The evaluation and comparison results show that the proposed approach is able to generate more accurate results while retrieving more articles related to IT.

Research limitations/implications

Although this research is limited to the IT subject, it can be generalized for any subject as well. However, for multidisciplinary topics such as IT, special attention should be given to the keyphrase extraction phase. In this research, bigram model is used; however, one can extend it to tri-gram as well.

Originality/value

This paper introduces an integrated approach for retrieving IT-related documents from a collection of scientific documents. The approach has two main phases: building a model for representing topic IT, and retrieving documents based on the model. The model, based on a set of keyphrases, extracted from a collection of IT articles. However, the extraction technique does not rely on Term Frequency-Inverse Document Frequency, since almost all of the articles in the collection share a set of same keyphrases. In addition, a probabilistic membership score is defined to retrieve the IT articles from a collection of scientific articles.

Keywords

Citation

Mohebi, A., Sedighi, M. and Zargaran, Z. (2017), "Subject-based retrieval of scientific documents, case study: Retrieval of Information Technology scientific articles", Library Review, Vol. 66 No. 6/7, pp. 549-569. https://doi.org/10.1108/LR-10-2016-0090

Publisher

:

Emerald Publishing Limited

Copyright © 2017, Emerald Publishing Limited

Related articles