N-Gram Extractor

Extract character or word n-grams with frequencies.

Overview

Extract every n-gram (contiguous sequence of n items) from your text, counted and sorted by frequency. Switch between character n-grams and word n-grams, and pick n from 1 (unigrams) up through 5 or more. The output is a frequency table you can copy into a spreadsheet or feed into other analyses.

Linguists profiling a corpus, SEO specialists checking keyword cluster density, NLP researchers building feature vectors, and cryptanalysts hunting for repeating ciphertext patterns all reach for an n-gram extractor. It's a staple feature in any text-mining workflow.

How it works

For character n-grams the tool walks your input one character at a time, sliding an n-character window, and tallies each window. For word n-grams it first tokenizes the input — splitting on whitespace and optionally stripping punctuation — then slides an n-word window. Options usually include: case folding, stop-word removal, minimum frequency threshold, and treatment of word boundaries (whether character n-grams cross spaces).

Examples

Input:    the cat sat on the mat
Word bigrams:
  the cat   1
  cat sat   1
  sat on    1
  on the    1
  the mat   1

Input:    HELLO
Character trigrams:
  HEL  1
  ELL  1
  LLO  1

FAQ

What's a typical n value?

Unigrams (n=1) and bigrams (n=2) are most common for word-level analysis. Character trigrams (n=3) are widely used in language detection and fuzzy matching.

Should I strip stop words?

Depends on the task. Keep them for stylometry and fluency analysis; drop them for keyword extraction and topic modeling so common words don't dominate.

Does it normalize punctuation?

Optional. By default punctuation is treated as part of the token; toggle the strip-punctuation option to focus on word content alone.

Try N-Gram Extractor