Linux: Extract whole and specific text from MS Word file (*.docx)

1
2
- unzip -p <filename.docx> | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep -oP "<regexp>"
+ unzip -p <filename.docx> word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep -oP "<regexp>"
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Use this combination of commands to extract specific text from ms word file (*.docx) by using regular expression directly from command line.

<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text

Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"

For whole text:

Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

*New: Added html entities to chars conversion.

1 Response

From php I used this method, but it was necessary to put "putenv('LANG=en_US.UTF-8');" in a line previus to show shell_exec result in order to show special chars perfectly.

0
Reply?
Marco Piñero 8 years ago

Linux: Extract whole and specific text from MS Word file (*.docx)

1 Response

Write a comment