Introduction: Here are some initial comparisons between an unclean OCRd batch of ATN vols. 20.1-21.2 and a batch of the same volumes, but with clean OCRd data). This preliminary analysis will help us determine if we need to go through the extra step of preparing the data of each volume or if we could just use unclean OCRd data for the topic modeling stages.
I. Textual Overview:
A. Summary
unclean data: This corpus has 5 documents with 85,052 total words and 11,350 unique word forms.
- Longest: V21.1.1960 (18073); V21.2.1960 test volume (17913)
- Shortest: V20.2.1959 (15618); V20.1.1959 (16376)
- Longest: V21.2.1960 (1) (16454); V20.1 1959 (15811)
- Shortest: V20.2 1959 (14750); V20.3 1959 (15565)
- Highest: V21.1.1960 (0.247); V20.1.1959 (0.242)
- Lowest: V21.2.1960 test volume (0.213); V20.2.1959 (0.228)
- Highest: V21.1 1960 (0.222); V20.3 1959 (0.214)
- Lowest: V21.2.1960 (1) (0.172); V20.2 1959 (0.182)
- Highest: V21.1.1960 (24.0); V20.1.1959 (23.3)
- Lowest: V20.2.1959 (14.5); V21.2.1960 test volume (15.0)
cleaned data
- Highest: V21.1 1960 (22.0); V20.1 1959 (21.9)
- Lowest: V20.2 1959 (13.7); V21.2.1960 (1) (13.8)
cleaned data: trail (1045); appalachian (472); good (393); club (376); blazes (345)
F. Distinctive words (compared to the rest of the corpus)
unclean data
- V20.1.1959: english (8), pepper (5), needle (5), design (5), regardless (4).
- V20.2.1959: recommended (42), yes (50), 1958 (112), markers (101), poor (22).
- V20.3.1959: knowles (8), gasteiger (7), attack (7), products (5), noticed (5).
- V21.1.1960: cordage (11), bark (18), dye (16), unaka (9), thread (9).
- V21.2.1960 test volume: recommended (54), 1960 (80), yes (66), renewal (34), markers (117).
- V20.1 1959: english (10), pepper (5), needle (5), detail (5), brushcutter (5).
- V20.2 1959: recommended (70), yes (57), markers (123), renewed (53), poor (25).
- V20.3 1959: knowles (6), attack (6), vol (5), gasteiger (5), frames (5).
- V21.1 1960: le (13), conte (13), unaka (9), cordage (8), clarendon (8).
- V21.2.1960 (1): 1960 (86), recommended (77), yes (74), renewal (38), markers (132).
Part II: Trends
unclean data
cleaned data
3. Cirrus
unclean data
cleaned data
PART II: Vols 17.2—21.2
Summary:
Distinctive words (compared to the rest of the corpus):
- 19.2 1958: yes (63), recommended (76), fair (48), secs (34), renewal (43).
- V17.2 1956: yes (64), secs (45), recommended (71), renewed (58), fair (33).
- V17.3 1956: wheel (18), counter (13), hawks (7), rentschler (10), shears (13).
- V19.1 1958: lettering (13), murder (7), stencils (6), stencil (6), kangchenjunga (6).
- V19.3 1958: million (21), resource (11), ownership (16), capacity (9), recreation (61).
- V20.1 1959: english (10), evergreens (12), pepper (5), needle (5), brushcutter (5).
- V20.2 1959: yes (57), 1959 (91), 1958 (125), markers (123), secs (37).
- V20.3 1959: knowles (6), attack (6), frames (5), hurd (9), wildflowers (4).
- V21.1 1960: conte (13), cordage (8), plants (25), le (13), gasoline (9).
- V21.2.1960 (1): 1960 (86), yes (74), 1959 (118), markers (132), recommended (77).
Trends:
With top 5 words: Trail Appalachian Club Good Work
Cirrus–> Corpus Tools–> Topics
creation of lean-tos
membership information
condition of trails
mountain road completion
reblazing projects
relation of trails due to fire
Appalachian guidebooks and maps
B.