python - RegEx Tokenizer to split a text into words, digits and punctuation marks -


what want split text ultimate elements.

for example:

from nltk.tokenize import * txt = "a sample sentences digits 2.119,99 or 2,99 awesome." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\s+') ['a','sample','sentences','with','digits','like','2.199,99','or','2,99','are','awesome','.'] 

you can see works fine. problem is: happens if digit @ end of text?

txt = "today it's 07.may 2011. or 2.999." regexp_tokenize(txt, pattern='(?:(?!\d)\w)+|\s+')  ['today', 'it', "'s", '07.may', '2011.', 'or', '2.999.']  

the result should be: ['today', 'it', "'s", '07.may', '2011','.', 'or', '2.999','.']

what have result above?

i created pattern try include periods , commas occurring inside words, numbers. hope helps:

txt = "today it's 07.may 2011. or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\s+') ['today', 'it', "'s", '07.may', '2011', '.', 'or', '2.999', '.'] 

Comments

Popular posts from this blog

c# - How to set Z index when using WPF DrawingContext? -

razor - Is this a bug in WebMatrix PageData? -

android - layout with fragment and framelayout replaced by another fragment and framelayout -