Cédric Colas

Academic and personal website.

Charabia

Code

What makes a word sound distinctly French? Can we capture the essence of a language’s phonology and morphology to generate words that don’t exist but feel authentic?

I built a system to generate imaginary French words using computational linguistics techniques. The approach leverages statistical patterns found in existing French vocabulary to create new words that sound plausibly French while being entirely novel.

The Algorithm Behind Word Creation

The generation process follows a straightforward but powerful approach:

  1. First, I assembled a corpus of 142,362 French words from books and movie subtitles available from Lexique.org
  2. Next, I computed transition statistics conditioned on the last N characters in each sequence
  3. Finally, I sampled from this transition matrix to generate new character sequences that follow French patterns

This method, based on N-gram modeling, captures the implicit rules of French word construction without explicitly programming them. The algorithm learns which letter combinations commonly follow others in authentic French words.

Finding the Sweet Spot: The Parameter N

The single parameter N significantly influences the balance between creativity and authenticity. As N increases, the word generation becomes more constrained by longer sequences of characters from the training corpus:

  • At N=3, we begin finding words that have a French-like quality while remaining clearly invented: sairions, bouvirents, talpottes, musiant, plamusse
  • At N=4, the words become more convincingly French-sounding while still novel: paraignes, embalisme, racinations
  • At N=5, approximately half of the generated words actually exist in French dictionaries, while the rest are inventions that could easily pass for authentic: lamistes, paraphie, fallotaient, amarcoussiens

This reveals an interesting tension in the generation process: with lower N values, we get more creativity but less linguistic authenticity; with higher values, we gain authenticity but sacrifice novelty as the algorithm begins reproducing existing words.

You can explore larger collections of generated words here:

Linguistic Patterns and Discoveries

The generated words reveal interesting aspects of French morphology. Many follow common French suffixes (-tion, -ment, -eux, -er) and maintain phonological patterns typical of the language. The algorithm implicitly learns rules about consonant clustering, vowel sequencing, and syllable structure that make French sound distinctive.

Some of my favorite generated words seem to suggest meanings based on their morphological resemblance to existing French vocabulary:

  • Climatum: Sounds like the contraction of climate and ultimatum
  • Fignotter: Suggests a verb meaning “to meticulously refine”
  • Craquille: Evokes something small and fragile that might crack
  • Glandalourdi: Has a wonderful rhythmic quality, perhaps describing someone sluggish
  • Couchottine: Sounds diminutive and cozy, perhaps a small sleeping place

Extensions and Applications

The code is readily adaptable to other languages, requiring only a dataset of words from the target language. The README provides detailed steps for this adaptation.

This approach to word generation has potential applications in:

  • Creative naming for products, companies, or characters
  • Linguistic research on morphological patterns
  • Creative writing and constructed languages
  • Educational tools for understanding language structure

Feel free to experiment with the code and generate your own linguistic inventions across different languages and parameter settings.