fuzzify-pashto
A JavaScript library that creates regular expressions (regex) for fuzzy searching Pashto text (approximate string matching)

Live Demo
Search String:
Options
Text to Search:
Matches:
پښتو ژبه د لرغونو آريایي ژبو څخه يوه خپلواکه ژبه ده چې له پخوا څخه په څو نومونو ياده شوې چې يو لړ نومونه ېې پښتو، پختو، پوختو، په هندي (پټاني)، په پاړسي او نړېوالې کچه د افغاني ژبې په نوم شهرت لريCode:
import { fuzzifyPashto } from "fuzzify-pashto";
const fuzzyRegex = fuzzifyPashto("سخه", {
es2018: true,
});Generated Regex:
(?<![ء-ٰٟ-ۓە])[صسثڅ][ا|و|ی|ي|ع]?[ښخشخهحغ][ا|و|ی|ي|ع]?(?:[اهحہۀ]|هٔ)[ا|و|ی|ي|ع]?Note: Regex may appear out of order due to browser display issues with RTL-LTR text
Problem
It can be difficult to search for words in Pashto texts or dictionaries because of variants or difficulties in spelling. This is because:
- Certain sounds in Pashto can be written with many different letters. This is true for Arabic and Farsi loan words.
- For instance, the ‘z’ sound can be spelled with a any one of the following: ز ذ ض ظ.
- Certain sounds in Pashto can be difficult for non-native speakers to distinguish.
- It may be hard for learners to hear the difference between the pairs such as: ر/ړ څ/س ځ/ز
- Along with this, in some dialects differences in the pronunciation of letters such as س and څ, or ښ and خ are lost.
- Spelling often changes based on dialect, area, and level of education.
- Some people may write “ګرځېدل” while others may write “گرزيدل”
- While the proper dictionary form may be “څنګه”, a learner may encounter the same word written as “سنگہ”.
Because of all these reasons, it can be difficult to search for words based on sound, or a particular non-standard spelling.
Solution
Search strings can be converted to regular expressions that can be used for fuzzy searching so that, for example:
| Searching For | Will Match |
|---|---|
| گرزيدل | ګرځېدل |
| سنگہ | څنګه |
| انطزار | انتظار |
| د پاره | دپاره |
| مالوم | معلوم |
| زبا | ژبه |
| سڑے | سړی |
and vice versa.
- TODO: A search for “له پاره” will match the word “لپاره”
Usage
npm install --save fuzzify-pashto
const { fuzzifyPashto } = import "fuzzify-pashto";
const fuzzyRegex = fuzzifyPashto("سرک");
console.log(fuzzyRegex);
// output: /(?:^|[^\u0621-\u065f\u0670-\u06d3\u06d5])?[صسثڅ]ع?[رړڑڼ]ع?[ګږکقگك]/gm
See the Live Demo for interactive usage examples.
API
fuzzifyPashto.fuzzifyPashto(input, [options])
Takes an input of a string of Pashto text (usually a word), and returns a RegEx expression that can be used for fuzzy searching for approximate matches in Pashto text.
Options
options.matchStart
Chooses where to allow matches in the string to start from
"word"(default) Matches starting only at the beginning of a Pashto word (This is like using\\b..., but with Pashto/Unicode functionality)"string"Matches only starting at the very beginning of the string/text (\^...\)"anywhere"Matches anywhere, from the beginning or middle of the words (\...\)
options.matchWholeWordOnly
false(default) Will match the beginning or parts of wordstrueWill only match if the whole word is provided. This overridesoptions.beginningAt = "anywhere"if set. (This is like using\\b...\b\but with Pashto/Unicode functionality)
options.allowSpacesInWords
false(default) Mid-word spaces in either the search input or the text will break matchestrueWill match regardless of spaces, ie.دپارهwill matchد پاره, and vice versa
options.script
"Pashto"(default) Use Pashto script (پښټو)"Latin"Use Latin script (phonetics) (puxto)
options.returnWholeWord
false(default) Will return just the matching characterstrueWill return the whole word attached to the matching characters
opitons.es2018
false(default) Will not use lookbehind assertions in regex because they are not supported on every platformtrueOnly to be used with an environment like recent version of Chrome or Node.js, where lookbehind assertions are supported. This allows for cleaner matching of word beginnings, properly handling spaces and punctuation. warning: Do not use this if using in different, unsupported environments. It will cause a syntax error.
options.ignoreDiacritics
falsedefault Diacritics will make or break matchestrueDiacritics will be ignored, and matches will work regardless of whether or not they are included