Corpus CusyA - Marie de Sade, Dr. Olaf Hoffmann, Authors ‚A, B, C, D, E, F, H‘ (poetry books to read txt) 📗
- Author: Marie de Sade, Dr. Olaf Hoffmann, Authors ‚A, B, C, D, E, F, H‘
Book online «Corpus CusyA - Marie de Sade, Dr. Olaf Hoffmann, Authors ‚A, B, C, D, E, F, H‘ (poetry books to read txt) 📗». Author Marie de Sade, Dr. Olaf Hoffmann, Authors ‚A, B, C, D, E, F, H‘
The paragraph length in characters is determined. Punctuation marks are counted and used to calculate sentences and the respective paragraph is split into sentences. Excessive blanks are removed and the respective sentence length is determined.
Each sentence is then split into words, spaces and punctuation are removed, the word length is analysed, and the incidence of each word is determined.
CusyA is a syllable font, not a letter font. In addition, CusyA intensively uses markers for grammatical characteristics, partly as independent structures, for example for paragraphs, verses, lines, but partly also as components of words, which consequently consist of markers and a word core. In the analysis with the general script used, the markers are not broken down further, which means that the script does not already interpret on a content level. On the one hand this could lead to a refinement of the analysis, on the other hand it represents a narrowing of the point of view, which is to be avoided at this level.
After the document has been analysed and reduced in this way, the incidence lists created are examined, mean value, standard deviation and skewness are calculated, and the distribution functions are visualised.
Quantitative and Comparative Text Analysis: The Examined Works of the Corpus
The corpus therefore contains several works that have not been decoded. Headings and author names can be assigned with the exception of the work about number and basic calculations. The individual works and authors are referred to with individual letters in the following only to simplify discussion and designation. The order of the works in this corpus has otherwise no meaning and is to be regarded as arbitrary. A contextual connection of the works is not known, nor is there a given order.
Unfortunately there are also no hints about the works which would make a chronological classification possible, so it remains also completely unknown whether the works somehow refer to each other or show references which could make a meaningful arrangement possible. The assignment to the code letters is therefore simply arbitrary.
The following tables give data for the complete corpus.
The texts examined each have a title page with information on the titles and the authors. To simplify the designation, placeholders in Latin script are used. The works are in order of arrangement:
In the following, the statistical data on the works are listed in tabular form. In addition to the entries for the individual works, the data for all works are also shown together.
With regard to the characterisation of the works of the corpus, it is of course useful to get a rough overview of their extent. The following table gives information about file size, number of characters, a first estimate of the number of words (estimate without fine corrections due to a deeper analysis of the content).
Further data for characterisation are the set of chapters, paragraphs, sentences, words, glyphs. This is summarised in the following table.
It should be noted that B is a poetic work, which means that it hardly contains any paragraphs, only on the title page. Therefore, in this work the verses were counted as paragraphs, which is not quite correct in content and semantics, but statistically similar enough to justify this procedure.
It is striking that the number of chapters consists of 1 (the title page) and a power of two.
After looking at the raw data, it is also helpful to take a closer look at the text structure and how many different words a work is composed of. Obviously, this is related to the total length of the work. However, the connection is not a simple proportionality, so that the relationship between different words and the total number of words can only be used to a limited extent to determine a characteristic feature of a text.
It is noticeable, however, that the ratio is relatively high in comparison with typical longer works in known languages.
With known languages, this ratio is usually below 0.1. The deviation here is probably also due to the intensive use of markers as prefixes, to which the same word cores are connected again and again, but which are counted as new words in the statistics due to the different prefixes.
In work D, for example, all tenses of verbs occur, but in a special systematic, which in any case already leads to the fact that verbs with different tenses are not repeated at all in such different chapters because of the tenses prefixes.
However, the value for the ratio for all works together is clearly below that of the individual works, which at least supports the hypothesis that the same language is used in all works.
In computer science, the information content or information density of a work is a quantity that is much more characteristic for a work. Size is measured in Shannon. Roughly this number indicates how many different bits are sufficient to characterise the work with the size of the respective level of abstraction. For books, glyphs and words are suitable as levels of abstraction.
If the glyphs of a letter writing are considered, in monolingual works usually about one hundred different glyphs are used, but basically some of them are redundant or are rarely used, so that instead of about a value of eight only one of four to five Shannon comes out. In fact, historical digital character sets like ASCII are currently encoded with 7 bits, or 128 characters. The slightly more extensive ISO encodings use 256 characters, respectively 8 bits.
In a multilingual work the value is of course higher. Likewise, the value for a syllable font is significantly higher.
CusyA is a syllabary with more than 300 frequently used syllables and several other characters.
According to the need to be able to represent characters of various languages, the character set of UTF-8 is no longer limited to a certain number of bits per character.
Typically a syllabary has more characters than a letter script, up to several hundred characters. The size of CusyA fits well to a syllabary.
More interesting than the information content on the abstraction level of glyphs is the information content on the word level. Typically, a work in a natural language is between ten and eleven Shannon. Closer to eleven more complicated scientific works or conspiracy theories are to be expected, which also often use a scientific or pseudoscientific vocabulary.
A value clearly under ten Shannon probably indicates a rather simple language, on the one hand easily understandable, on the other hand perhaps not particularly demanding or impressive. This already shows that the information content alone is by no means a sign of quality, because it is quite possible that works with a relatively small information content are easy to read precisely because of this, and are therefore particularly suitable for relaxation. Perhaps they are also well suited to lulling the readership.
Also with regard to the information content regarding the words, multilingual works naturally have greater values.
With a Shannon value of over six for glyphs, CusyA proves to be a rich language or script with a relatively large number of syllables, most of which are actually used in the works, even relatively evenly. There are some limitations here primarily in the markers, some of which can be completely omitted in one work, others may occur regularly. This also explains the fact that the Shannon value of work D is even slightly higher than that of all works together.
On the one hand the large Shannon value is due to the fact that it is a syllabary, on the other hand certainly also to the fact that the language or writing represents grammar and structure briefly and concisely with markers, otherwise a larger value would probably be expected.
Since the Shannon values are quite similar at the word level, this supports the thesis that all works are of the same language.
The situation is quite different on the word level, on the one hand the Shannon values of the individual works vary considerably among themselves, on the other hand they are very high at between 12.4 and 15.2, so they certainly represent very complex facts in terms of content or are poetically very verbose or opulent, depending on how one wants to see it.
Even a superficial glance at the texts reveals the abundant use of adjectives and adverbs, and the manifold possibilities of object extensions may also contribute to the richness of the works. It may be presumed that the works are not simple entertainment literature similar to the so-called penny dreadfuls, but they contain much more information.
The variation of the Shannon values at word level also supports the thesis that these are different authors. The complexity could indicate that they are also outstanding works of CusyA culture or literature.
Since Z is a collection of loose sheets with small volume about numbers, basic arithmetic operations and formulae, the clear deviation from the other works is not surprising.
Quantitative and Comparative Text Analysis: Data and Results
The following tables show the results of the quantitative analysis of the CusyA works of the corpus. The first statistical moments for each characteristic quantity are given.
Because work B is a poem, it contains paragraphs only on the title page, so values related to paragraphs are not particularly meaningful for the work. Here, verses were generalised and counted as paragraphs.
Glyphs: Comparison Incidence
The following graph shows a comparison of the relative incidences of glyphs in the works of the language CusyA. The differences and characteristics of the different distributions are clearly visible.
In particular, the texts B and Z differ significantly from the other texts. Text B is poetry, this means, it contains verses and strophic lines. A special structure is therefore plausible.
Work Z is the loose collection about numbers, so of course they dominate there.
The differences between the other texts, all prose, but with a different structure at the level of chapters and subchapters, are much more subtle. The characteristic structure of a common language can already be seen in them. Similar glyph incidences thus support the hypothesis that the works are written in the same language.
Different incidences of markers between individual works are plausible because, for example, different time forms are used. Poetic works hardly use proper personal names.
Incidence of Word Lengths in Glyphs: Comparison of Distributions
The following graphic shows a comparison of the word lengths in glyphs in the works of the language CusyA. Here, too, the similarities, differences and characteristics of the different distributions are clearly recognisable.
Common to all is a compactness of the words, which indicates that in CusyA composites are avoided. This is obviously observed even more consistently than in the English language, for example. Moreover, it is a syllabary. Thus eleven syllables, mostly even less than six, including the markers, are sufficient to form a word.
The accumulation of a syllable is due to single markers for paragraphs, verses, lines and question words.
Incidence of Words per Sentence: Comparison of Distributions
The following graphic shows a comparison of the words per sentence in the works of the language CusyA. The similarities, differences and characteristics of the different distributions are clearly recognisable.
All in all, the texts show fairly even, harmonious distributions. Of course, the loose collection on numbers and mathematics Z differs considerably. Again, the poetic work B also
Comments (0)