Introduction: Here are some initial comparisons between an unclean OCRd batch of ATN vols. 20.1-21.2 and a batch of the same volumes, but with clean OCRd data). This preliminary analysis will help us determine if we need to go through the extra step of preparing the data of each volume or if we could just use unclean OCRd data for the topic modeling stages.

I. Textual Overview:

A. Summary

unclean data: This corpus has 5 documents with 85,052 total words and 11,350 unique word forms.

Part II: Trends

unclean data

cleaned data

3. Cirrus

unclean data

 

cleaned data

 

PART II: Vols 17.2—21.2

Summary:

Distinctive words (compared to the rest of the corpus):

  1. 19.2 1958yes (63), recommended (76), fair (48), secs (34), renewal (43).
  2. V17.2 1956yes (64), secs (45), recommended (71), renewed (58), fair (33).
  3. V17.3 1956wheel (18), counter (13), hawks (7), rentschler (10), shears (13).
  4. V19.1 1958lettering (13), murder (7), stencils (6), stencil (6), kangchenjunga (6).
  5. V19.3 1958million (21), resource (11), ownership (16), capacity (9), recreation (61).
  6. V20.1 1959english (10), evergreens (12), pepper (5), needle (5), brushcutter (5).
  7. V20.2 1959yes (57), 1959 (91), 1958 (125), markers (123), secs (37).
  8. V20.3 1959knowles (6), attack (6), frames (5), hurd (9), wildflowers (4).
  9. V21.1 1960conte (13), cordage (8), plants (25), le (13), gasoline (9).
  10. V21.2.1960 (1)1960 (86), yes (74), 1959 (118), markers (132), recommended (77).

Trends:


With top 5 words: Trail Appalachian Club Good Work

Cirrus–> Corpus Tools–> Topics


creation of lean-tos

membership information

condition of trails

mountain road completion

reblazing projects

relation of trails due to fire

Appalachian guidebooks and maps

B.

Leave a Reply

Your email address will not be published. Required fields are marked *