Reuters-21578 Text Categorization Collection

Reuters-21578

Datasets for single-label text categorization

The datasets below are taken from Ana Cardoso-Cachopo's Home Page.

20 Newsgroups

20 Newsgroups
Class# train docs# test docsTotal # docs
alt.atheism480319799
comp.graphics584389973
comp.os.ms-windows.misc572394966
comp.sys.ibm.pc.hardware590392982
comp.sys.mac.hardware578385963
comp.windows.x593392985
misc.forsale585390975
rec.autos594395989
rec.motorcycles598398996
rec.sport.baseball597397994
rec.sport.hockey600399999
sci.crypt595396991
sci.electronics591393984
sci.med594396990
sci.space593394987
soc.religion.christian598398996
talk.politics.guns545364909
talk.politics.mideast564376940
talk.politics.misc465310775
talk.religion.misc377251628
Total11293752818821

R52 and R8 of Reuters 21578

Reuters 21578
# Topics# train docs# test docs# otherTotal # docs
01828280810310211
1655225813619494
28903091351334
31916455310
4623210104
53914861
6216330
774011
84206
94206
103104
110112
121102
130000
140202
150000
161001

The distribution of documents per class is the following for R8 and R52:

R8
Class# train docs# test docsTotal # docs
acq15966962292
crude253121374
earn284010833923
grain411051
interest19081271
money-fx20687293
ship10836144
trade25175326
Total548521897674

R52
Class# train docs# test docsTotal # docs
acq15966962292
alum311950
bop22931
carcass6511
cocoa461561
coffee9022112
copper311344
cotton15924
cpi541771
cpu314
crude253121374
dlr336
earn284010833923
fuel4711
gas10818
gnp581573
gold702090
grain411051
heat6410
housing15217
income7411
instal-debt516
interest19081271
ipi331144
iron-steel261238
jet213
jobs371249
lead448
lei11314
livestock13518
lumber7411
meal-feed617
money-fx20687293
money-supply12328151
nat-gas241236
nickel314
orange13922
pet-chem13619
platinum123
potato235
reserves371249
retail19120
rubber31940
ship10836144
strategic-metal9615
sugar9725122
tea235
tin171027
trade25175326
veg-oil191130
wpi14923
zinc8513
Total653225689100

WebKB

WebKB
Class # train docs# test docsTotal # docs
project 336 168 504
course 620 310 930
faculty 750 374 1124
student 1097 544 1641
Total 2803 1396 4199

The files

From here, you can download the files.

20 Newsgroups
Train Test
# documents 11293 docs 7528 docs
all-terms
20ng-train-all-terms
15.91 Mb
20ng-test-all-terms
10.31 Mb
no-short
20ng-train-no-short
14.06 Mb
20ng-test-no-short
9.12 Mb
no-stop
20ng-train-no-stop
10.59 Mb
20ng-test-no-stop
6.86 Mb
stemmed
20ng-train-stemmed
9.46 Mb
20ng-test-stemmed
6.13 Mb

Reuters-21578 R8 Reuters-21578 R52
Train Test Train Test
# documents 5485 docs 2189 docs 6532 docs 2568 docs
all-terms
r8-train-all-terms
3.20 Mb
r8-test-all-terms
1.14 Mb
r52-train-all-terms
4.08 Mb
r52-test-all-terms
1.45 Mb
no-short
r8-train-no-short
2.90 Mb
r8-test-no-short
1.03 Mb
r52-train-no-short
3.71 Mb
r52-test-no-short
1.32 Mb
no-stop
r8-train-no-stop
2.42 Mb
r8-test-no-stop
0.86 Mb
r52-train-no-stop
3.08 Mb
r52-test-no-stop
1.09 Mb
stemmed
r8-train-stemmed
2.13 Mb
r8-test-stemmed
0.76 Mb
r52-train-stemmed
2.71 Mb
r52-test-stemmed
0.96 Mb

WebKB
Train Test
# documents 2803 docs 1396 docs
stemmed
webkb-train-stemmed
2.40 Mb
webkb-test-stemmed
1.20 Mb

File description

All of these are text files containing one document per line.

Each document is composed by its class and its terms.

Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.

Pre-processing

From the original datasets, in order to obtain the present files, Ana applied the following pre-processing:

  1. all-terms Obtained from the original datasets by applying the following transformations:
    1. Substitute TAB, NEWLINE and RETURN characters by SPACE.
    2. Keep only letters (that is, turn punctuation, numbers, etc. into SPACES).
    3. Turn all letters to lowercase.
    4. Substitute multiple SPACES by a single SPACE.
    5. The title/subject of each document is simply added in the beginning of the document's text.
  2. no-short Obtained from the previous file, by removing words that are less than 3 characters long. For example, removing "he" but keeping "him".
  3. no-stop Obtained from the previous file, by removing the 524 SMART stopwords. Some of them had already been removed, because they were shorter than 3 characters.
  4. stemmed Obtained from the previous file, by applying Porter's Stemmer to the remaining words. Information about stemming can be found here.

Last updated April 2007.

Go back to Selim's Homepage