Raw data / Api ?

Archenior #655

Hi everyone,

I'm currently working on a little program, that can generate new words based on a corpus of words. The main problem here, is that i lack corpus ! I'm a big fan of tolkien, and i wish i have access to a corpus of qenyan or sindarin words, so I've found your website.

The answer is : Is there any access to the "raw" data, or an api, that the public can have to use your titanic work ? All i need is a very big list of words to analyze.

For those interested, i put here the link to the program (hope it's permitted here), it's using a Markov chain to get stats. github.com

Well anyway, thank you all for the good work your doing here, it's amazing !

Thank you for reading :) ! Valar morgulis (oh no, wrong universe !)

Edit : Well i managed to use some of your doc (www.eldamo.org) to get corpora, they are accessible on the github in the "source_file" directory, feel free tu use ;) !

Paul Strack #656

This isn’t an inherently bad idea, but there is more “depth” to Tolkien’s languages than in most Conlangs. In particular, he considered the full phonological history of his languages from an imagined “Primitive Elvish” language, much the same way that modern European languages evolved from ancient Primitive Indo-European languages.

Simple analysis of sound frequencies and letter combinations wouldn’t accurately reflect this depth. The bar for creating new words in Elvish is just “higher” than it is in most fictional languages.

EDIT: the XML data files for Eldamo have a very liberal Creative Common License, so feel free to use them for experiments like this.

Ríon Gondremborion #657

Not any sort of an authority figure here: just a member of the community trying to provide some advice. If you're planning on using this program to generate new words for Quenya and Sindarin: I'd recommend keeping those words to your own use and not spreading them as "neologisms" for either Elvish language.

Generating "new" vocabulary for either language is not a result of randomly coming up with a word that fits the phonology of either language and assigning a meaning to it. It requires doing research into the proto-elvish roots that Tolkien created, then assigning suffixes, affixes, and other changes to the exact root to produce an archaic word: then walking said word through certain sound changes to reach the target language. An example of this: the Quenya verb "nac-" and the Sindarin verb "(n)dag-" come from the same root found in Proto-Elvish, "√NDAK". This is also why there are many words of a related meaning that sound similar, such as Sindarin dagor, "battle". To elaborate, check out √NDAK's entry on Eldamo.

It is this level of evolution and history in Tolkien's elvish languages that separates them from many other conlangs, such as your example in High Valyrian. For conlangs such as those: this method (I'm presuming it mulls over the corpus, generates a list of phonologic rules, then applies those rules to piece together new words) would work well. It just doesn't fit Tolkien's languages: these words would have no etymology.

Once again, I'm not downgrading your work in any way. Your program would be of massive use to other conlangers; it just doesn't fit Elvish.

With best wishes,

Ríon Gondremborion

Ríon Gondremborion

Archenior #662

Thank you for answering in such a complete way !

Well no need to worry, my purpose is not to editate new sindarin words at all, but I just need huge corpora of words to test my program and see if the results "sounds good". In a matter of fact, i'm searching corpora in other non-construct languages, such as French, English, Finnish, etc. The final goal is to create some algorithm capable of confusing someone about whether or not a word is an obscure word of is native language, or a procedurally generated word.

There are no real application to this, except extending construct language such as High Valyrian, if what you say about it is true, but i'm just doing it because i both love words and algorithm :')

I hope no one felt that i was stating that "Elvish words are just random letters vaguely assembled", because it was obviously not in my mind !

Once again, great work and great mentality in this community, I'll keep an eye here ! Best Regards, Archenior

Aldaleon #664

This sounds very interesting indeed! Can you share some of the application's output with us?

Thanks,
Aldaleon

Archenior #671

I can, here is an output based on sindarin words : github.com But I've got to admit that the output does not have the "organic" taste of the corpus.

This output is composed of words between 5 and 10 letters (100 words of each length), and in the 500 generated words, only a few seems real :) (And they may be in the original corpus, because the program often recompose original words).

Archenior

Ríon Gondremborion #672

They do fit within known Sindarin phonology pretty darn well; I'd be curious to see how it would do with the more restrictive phonology of Quenya.

Aldaleon #673

I agree with Ríon! I am very impressed!

Archenior #674

Thank you for positive comments :) Here is the quenyan render if you want to compare ^^ github.com

Paul Strack #675

The results are better than I expected, but I do see invalid words in both the Quenya and Sindarin lists. There are invalid Quenya initial clusters like mm- or nt-, and there are long vowels and diphthongs in final syllables and before consonant clusters where they should not occur. From the Sindarin list, the diphthong au appears in some polysyllables as well as final nn and ss, neither of which are possible.

Still, probably at least 95% of the words on the list are at least phonetically possible, at least from my eyeballing the results.