Linux: Extract whole and specific text from MS Word file (*.docx)

unzip -p <filename.docx> | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep -oP "<regexp>"
Use this combination of commands to extract specific text from ms word file (*.docx) by using regular expression directly from command line.

<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text

Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"

For whole text:

Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'

*New: Added html entities to chars conversion.

1 Response

From php I used this method, but it was necessary to put "putenv('LANG=en_US.UTF-8');" in a line previus to show shell_exec result in order to show special chars perfectly.

Write a comment

You can use [html][/html], [css][/css], [php][/php] and more to embed the code. Urls are automatically hyperlinked. Line breaks and paragraphs are automatically generated.