python - RegEx Tokenizer to split a text into words, digits and punctuation marks -
what want split text ultimate elements.
for example:
from nltk.tokenize import * txt = "a sample sentences digits 2.119,99 or 2,99 awesome." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\s+') ['a','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.']
you can see works fine. problem is: happens if digit @ end of text?
txt = "today it's 07.may 2011. or 2.999." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\s+') ['today', 'it', "'s", '07.may', '2011.', 'or', '2.999.']
the result should be: ['today', 'it', "'s", '07.may', '2011','.', 'or', '2.999','.']
what have result above?
i created pattern try include periods , commas occurring inside words, numbers. hope helps:
txt = "today it's 07.may 2011. or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\s+') ['today', 'it', "'s", '07.may', '2011', '.', 'or', '2.999', '.']
Comments
Post a Comment