What makes a word sound distinctly French? Can we capture the essence of a language’s phonology and morphology to generate words that don’t exist but feel authentic?
I built a system to generate imaginary French words using computational linguistics techniques. The approach leverages statistical patterns found in existing French vocabulary to create new words that sound plausibly French while being entirely novel.
The generation process follows a straightforward but powerful approach:
This method, based on N-gram modeling, captures the implicit rules of French word construction without explicitly programming them. The algorithm learns which letter combinations commonly follow others in authentic French words.
The single parameter N significantly influences the balance between creativity and authenticity. As N increases, the word generation becomes more constrained by longer sequences of characters from the training corpus:
This reveals an interesting tension in the generation process: with lower N values, we get more creativity but less linguistic authenticity; with higher values, we gain authenticity but sacrifice novelty as the algorithm begins reproducing existing words.
You can explore larger collections of generated words here:
The generated words reveal interesting aspects of French morphology. Many follow common French suffixes (-tion, -ment, -eux, -er) and maintain phonological patterns typical of the language. The algorithm implicitly learns rules about consonant clustering, vowel sequencing, and syllable structure that make French sound distinctive.
Some of my favorite generated words seem to suggest meanings based on their morphological resemblance to existing French vocabulary:
The code is readily adaptable to other languages, requiring only a dataset of words from the target language. The README provides detailed steps for this adaptation.
This approach to word generation has potential applications in:
Feel free to experiment with the code and generate your own linguistic inventions across different languages and parameter settings.