| 20 Newsgroups | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| alt.atheism | 480 | 319 | 799 |
| comp.graphics | 584 | 389 | 973 |
| comp.os.ms-windows.misc | 572 | 394 | 966 |
| comp.sys.ibm.pc.hardware | 590 | 392 | 982 |
| comp.sys.mac.hardware | 578 | 385 | 963 |
| comp.windows.x | 593 | 392 | 985 |
| misc.forsale | 585 | 390 | 975 |
| rec.autos | 594 | 395 | 989 |
| rec.motorcycles | 598 | 398 | 996 |
| rec.sport.baseball | 597 | 397 | 994 |
| rec.sport.hockey | 600 | 399 | 999 |
| sci.crypt | 595 | 396 | 991 |
| sci.electronics | 591 | 393 | 984 |
| sci.med | 594 | 396 | 990 |
| sci.space | 593 | 394 | 987 |
| soc.religion.christian | 598 | 398 | 996 |
| talk.politics.guns | 545 | 364 | 909 |
| talk.politics.mideast | 564 | 376 | 940 |
| talk.politics.misc | 465 | 310 | 775 |
| talk.religion.misc | 377 | 251 | 628 |
| Total | 11293 | 7528 | 18821 |
| Reuters 21578 | ||||
|---|---|---|---|---|
| # Topics | # train docs | # test docs | # other | Total # docs |
| 0 | 1828 | 280 | 8103 | 10211 |
| 1 | 6552 | 2581 | 361 | 9494 |
| 2 | 890 | 309 | 135 | 1334 |
| 3 | 191 | 64 | 55 | 310 |
| 4 | 62 | 32 | 10 | 104 |
| 5 | 39 | 14 | 8 | 61 |
| 6 | 21 | 6 | 3 | 30 |
| 7 | 7 | 4 | 0 | 11 |
| 8 | 4 | 2 | 0 | 6 |
| 9 | 4 | 2 | 0 | 6 |
| 10 | 3 | 1 | 0 | 4 |
| 11 | 0 | 1 | 1 | 2 |
| 12 | 1 | 1 | 0 | 2 |
| 13 | 0 | 0 | 0 | 0 |
| 14 | 0 | 2 | 0 | 2 |
| 15 | 0 | 0 | 0 | 0 |
| 16 | 1 | 0 | 0 | 1 |
The distribution of documents per class is the following for R8 and R52:
| R8 | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| acq | 1596 | 696 | 2292 |
| crude | 253 | 121 | 374 |
| earn | 2840 | 1083 | 3923 |
| grain | 41 | 10 | 51 |
| interest | 190 | 81 | 271 |
| money-fx | 206 | 87 | 293 |
| ship | 108 | 36 | 144 |
| trade | 251 | 75 | 326 |
| Total | 5485 | 2189 | 7674 |
| R52 | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| acq | 1596 | 696 | 2292 |
| alum | 31 | 19 | 50 |
| bop | 22 | 9 | 31 |
| carcass | 6 | 5 | 11 |
| cocoa | 46 | 15 | 61 |
| coffee | 90 | 22 | 112 |
| copper | 31 | 13 | 44 |
| cotton | 15 | 9 | 24 |
| cpi | 54 | 17 | 71 |
| cpu | 3 | 1 | 4 |
| crude | 253 | 121 | 374 |
| dlr | 3 | 3 | 6 |
| earn | 2840 | 1083 | 3923 |
| fuel | 4 | 7 | 11 |
| gas | 10 | 8 | 18 |
| gnp | 58 | 15 | 73 |
| gold | 70 | 20 | 90 |
| grain | 41 | 10 | 51 |
| heat | 6 | 4 | 10 |
| housing | 15 | 2 | 17 |
| income | 7 | 4 | 11 |
| instal-debt | 5 | 1 | 6 |
| interest | 190 | 81 | 271 |
| ipi | 33 | 11 | 44 |
| iron-steel | 26 | 12 | 38 |
| jet | 2 | 1 | 3 |
| jobs | 37 | 12 | 49 |
| lead | 4 | 4 | 8 |
| lei | 11 | 3 | 14 |
| livestock | 13 | 5 | 18 |
| lumber | 7 | 4 | 11 |
| meal-feed | 6 | 1 | 7 |
| money-fx | 206 | 87 | 293 |
| money-supply | 123 | 28 | 151 |
| nat-gas | 24 | 12 | 36 |
| nickel | 3 | 1 | 4 |
| orange | 13 | 9 | 22 |
| pet-chem | 13 | 6 | 19 |
| platinum | 1 | 2 | 3 |
| potato | 2 | 3 | 5 |
| reserves | 37 | 12 | 49 |
| retail | 19 | 1 | 20 |
| rubber | 31 | 9 | 40 |
| ship | 108 | 36 | 144 |
| strategic-metal | 9 | 6 | 15 |
| sugar | 97 | 25 | 122 |
| tea | 2 | 3 | 5 |
| tin | 17 | 10 | 27 |
| trade | 251 | 75 | 326 |
| veg-oil | 19 | 11 | 30 |
| wpi | 14 | 9 | 23 |
| zinc | 8 | 5 | 13 |
| Total | 6532 | 2568 | 9100 |
| WebKB | |||
|---|---|---|---|
| Class | # train docs | # test docs | Total # docs |
| project | 336 | 168 | 504 |
| course | 620 | 310 | 930 |
| faculty | 750 | 374 | 1124 |
| student | 1097 | 544 | 1641 |
| Total | 2803 | 1396 | 4199 |
| 20 Newsgroups | ||
|---|---|---|
| Train | Test | |
| # documents | 11293 docs | 7528 docs |
all-terms |
20ng-train-all-terms 15.91 Mb |
20ng-test-all-terms 10.31 Mb |
no-short |
20ng-train-no-short 14.06 Mb |
20ng-test-no-short 9.12 Mb |
no-stop |
20ng-train-no-stop 10.59 Mb |
20ng-test-no-stop 6.86 Mb |
stemmed |
20ng-train-stemmed 9.46 Mb |
20ng-test-stemmed 6.13 Mb |
| Reuters-21578 R8 | Reuters-21578 R52 | |||
|---|---|---|---|---|
| Train | Test | Train | Test | |
| # documents | 5485 docs | 2189 docs | 6532 docs | 2568 docs |
all-terms |
r8-train-all-terms 3.20 Mb |
r8-test-all-terms 1.14 Mb |
r52-train-all-terms 4.08 Mb |
r52-test-all-terms 1.45 Mb |
no-short |
r8-train-no-short 2.90 Mb |
r8-test-no-short 1.03 Mb |
r52-train-no-short 3.71 Mb |
r52-test-no-short 1.32 Mb |
no-stop |
r8-train-no-stop 2.42 Mb |
r8-test-no-stop 0.86 Mb |
r52-train-no-stop 3.08 Mb |
r52-test-no-stop 1.09 Mb |
stemmed |
r8-train-stemmed 2.13 Mb |
r8-test-stemmed 0.76 Mb |
r52-train-stemmed 2.71 Mb |
r52-test-stemmed 0.96 Mb |
| WebKB | ||
|---|---|---|
| Train | Test | |
| # documents | 2803 docs | 1396 docs |
stemmed |
webkb-train-stemmed 2.40 Mb |
webkb-test-stemmed 1.20 Mb |
All of these are text files containing one document per line.
Each document is composed by its class and its terms.
Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.
Last updated April 2007.
Go back to Selim's Homepage