unzip -p <filename.docx> word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep -oP "<regexp>" \
| perl -MHTML::Entities -pe 'decode_entities($_);'
#for whole text:
unzip -p <filename.docx> word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g' \
| perl -MHTML::Entities -pe 'decode_entities($_);'
#whole text, cleaner:
unzip -p <filename.docx> word/document.xml | sed -e 's/<wp:align>\w*<\/wp:align>//g; s/<wp14:pctWidth>\w*<\/wp14:pctWidth>//g; s/<wp14:pctHeight>\w*<\/wp14:pctHeight>//g; s/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g; s/\n\n/\n/g'
Use this combination of commands to extract specific text from ms word file (*.docx) by using regular expression directly from command line.
<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text
Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"
For whole text:
Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
*New: Added html entities to chars conversion.
<filename.docx>: specific file docx file
<regexp>: regular expression for filtering text
Ex: unzip -p mydoc.docx | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep "name\:(.*)\s*"
For whole text:
Ex: unzip -p mydoc.docx word/document.xml | sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
*New: Added html entities to chars conversion.
1 Response
Write a comment
You can use [html][/html], [css][/css], [php][/php] and more to embed the code. Urls are automatically hyperlinked. Line breaks and paragraphs are automatically generated.