Linux: Extract whole and specific text from MS Word file (*.docx)

unzip -p <filename.docx> word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep -oP "<regexp>" \ | perl -MHTML::Entities -pe 'decode_entities($_);' #for whole text: unzip -p <filename.docx> word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g' \ | perl -MHTML::Entities -pe 'decode_entities($_);' #whole text, cleaner: unzip -p <filename.docx> word/document.xml | sed -e 's/<wp:align>\w*<\/wp:align>//g; s/<wp14:pctWidth>\w*<\/wp14:pctWidth>//g; s/<wp14:pctHeight>\w*<\/wp14:pctHeight>//g; s/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g; s/\n\n/\n/g'
Use this combination of commands to extract specific text from ms word file (*.docx) by using regular expression directly from command line.

<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text

Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"

For whole text:

Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

*New: Added html entities to chars conversion.

1 Response

From php I used this method, but it was necessary to put "putenv('LANG=en_US.UTF-8');" in a line previus to show shell_exec result in order to show special chars perfectly.

Write a comment

You can use [html][/html], [css][/css], [php][/php] and more to embed the code. Urls are automatically hyperlinked. Line breaks and paragraphs are automatically generated.