unzip -p <filename.docx> word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep -oP "<regexp>" \
| perl -MHTML::Entities -pe 'decode_entities($_);'
#for whole text:
unzip -p <filename.docx> word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g' \
| perl -MHTML::Entities -pe 'decode_entities($_);'
Use this combination of commands to extract specific text from ms word file (*.docx) by using regular expression directly from command line.
<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text
Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"
For whole text:
Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
*New: Added html entities to chars conversion.
<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text
Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"
For whole text:
Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
*New: Added html entities to chars conversion.
1 Response
Write a comment
You can use [html][/html], [css][/css], [php][/php] and more to embed the code. Urls are automatically hyperlinked. Line breaks and paragraphs are automatically generated.