Classifying compressed physics article abstracts by locality-sensitive hashing

Abstract

Data is often irreversibly converted into hash codes for storage and processing. Since hashing compresses text so that the original content is no longer readable, many conventional text classification techniques can no longer be applied to them. In this study, we demonstrate that a similarity search technique known as locality-sensitive hashing can be used to accurately classify hashed abstracts from the arXiv repository. Models like these can be used to recommend subject class labels for new or uncategorized physics article abstracts.