Cédric Colas

Academic and personal website.

Charabia

Code

I built a piece of code to generate imaginary French words. The idea is simple:

  1. Take a corpus of French words;
  2. compute the transition statistics conditioned on the last N characters;
  3. sample from that transition matrix.

I used a database of French books and movies subtitle available here (142,362 words).

The only parameter is N. As N increases, the generation becomes more constrained; words start to sound more French, but a higher proportion of them end up being existing French words.

We can start to find French sounding words from N=3: sairions, bouvirents, talpottes, musiant, plamusse.

At N=5, about half of the words are actual French words, but the rest is imagined: paraignes, lamistes, embalisme, racinations, paraphie, fallotaient, amarcoussiens.

Find larger lists of words for N=3, N=4, and N=5.

The code can be easily extended to other languages provided that you have a dataset of words from that language (the code’s readme lists the steps).