An Example-Based Mapping Method for Text Categorization and Retrieval

Yiming Yang; Christopher G. Chute

doi:10.1145/183422.183424

An Example-Based Mapping Method for Text Categorization and Retrieval

Yiming Yang, Christopher G. Chute

Research output: Contribution to journal › Article › peer-review

269 Scopus citations

Abstract

A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit 1994 technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.

Original language	English (US)
Pages (from-to)	252-277
Number of pages	26
Journal	ACM Transactions on Information Systems (TOIS)
Volume	12
Issue number	3
DOIs	https://doi.org/10.1145/183422.183424
State	Published - Jan 7 1994
Externally published	Yes

Keywords

document categorization
query categorization
statistical learning of human decisions

ASJC Scopus subject areas

Information Systems
General Business, Management and Accounting
Computer Science Applications

Access to Document

10.1145/183422.183424

Cite this

@article{29c04af6692c426a8ba555bdd481c388,

title = "An Example-Based Mapping Method for Text Categorization and Retrieval",

abstract = "A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit 1994 technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.",

keywords = "document categorization, query categorization, statistical learning of human decisions",

author = "Yiming Yang and Chute, {Christopher G.}",

year = "1994",

month = jan,

day = "7",

doi = "10.1145/183422.183424",

language = "English (US)",

volume = "12",

pages = "252--277",

journal = "ACM Transactions on Information Systems (TOIS)",

issn = "1046-8188",

publisher = "Association for Computing Machinery (ACM)",

number = "3",

}

TY - JOUR

T1 - An Example-Based Mapping Method for Text Categorization and Retrieval

AU - Yang, Yiming

AU - Chute, Christopher G.

PY - 1994/1/7

Y1 - 1994/1/7

N2 - A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit 1994 technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.

AB - A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit 1994 technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.

KW - document categorization

KW - query categorization

KW - statistical learning of human decisions

UR - http://www.scopus.com/inward/record.url?scp=0028461554&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0028461554&partnerID=8YFLogxK

U2 - 10.1145/183422.183424

DO - 10.1145/183422.183424

M3 - Article

AN - SCOPUS:0028461554

SN - 1046-8188

VL - 12

SP - 252

EP - 277

JO - ACM Transactions on Information Systems (TOIS)

JF - ACM Transactions on Information Systems (TOIS)

IS - 3

ER -

An Example-Based Mapping Method for Text Categorization and Retrieval

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this