I wrote an xml parser that parses ASCII files, but I need now to be able to read UTF-8 encoded files. I have the following regex in Lexx but they don't match UTF-8 I am not sure what I am doing wrong:
utf_8 [\x00-\xff]*
bom [\xEF\xBB\xBF]
then:
bom { fprintf( stderr, "OMG I SAW A BOM"); return BOM;}
utf_8 { fprintf( stderr, "OMG I SAW A UTF CHAR", yytext[0] ); return UTF_8; }
I also have the following grammar rules:
program
: UTF8 '<' '?'ID attribute_list '?''>'
root ...
where UTF8
is:
UTF8
: BOM {printf("i saw a bom\n");}
| UTF_8 {printf("i saw a utf\n");}
| {printf("i didn't see anything.'\n");}
;
It always comes up with i didn't see anything
, my parser works for ASCII files, that is when I copy paste the XML UTF-8 file in a empty document.
Any help would be appreciated.
No comments:
Post a Comment