File Oddity
Today at work I was attempting to parse a file and discovered something odd happening. When I simply viewed the file with cat, I could see this:
<html><head><title>Status</title></head> <table> <tr><td>Failed</td><td>Backup Group</td></tr> <tr><td>Success</td><td>Another Backup Group</td></tr> </table> </body> </html>
Nothing odd there, the file is normal but when I tried this command:
$ grep -i failed status.html $
Huh? No output, suggesting that there is no lines with the words failed on them. The same occurs with awk and sed, indeed I could not find any tool to be able to grep out the status. So the next step was to check what was odd about the file:
$ file status.html status.html: HTML document text
Still, nothing unusual. So now I am in full head scratching mode, I open up the file with vim to see if I can discover anything strange about the file but nothing. At this point I happened to switch to more rather than cat and the result was the start of how I solved it.
$ more status.html ��< $
Ah, so now I can see why grep and the others cannot view anything in the file. So this time I switch to vim again and check what file encoding we have:
:set fileencoding fileencoding=ucs-2le
For those that are unaware, this is UCS-2 (little endian), also know as UTF-16. So the issue was simply that we had UTF-16 characters, now for the trick to get around it:
$ iconv -f UTF-16 -t UTF-8 status.html | grep -i failed <tr><td>Failed</td><td>Backup Group</td></tr> $
Tada. Once more a solution!
Comments