Tā kā mēs apskatījām visas satāvdaļas, kas nepieciešamas funkcijai extracttext, apskatīsim, kā tas izskatās kopumā:
function nodeToHtml( $node_content ) { //Pārbaudām vai šim nodam ir jādod bold formatēšana preg_match( "'\<w:b\>\<\/w:b\>'", $node_content, $bold ); //Pārbaudām vai šim nodam ir jādod italic formatēšana preg_match( "'\<w:i\>\<\/w:i\>'", $node_content, $italic ); //Iegūstam fonta izmēru preg_match( "'\\<\/w:sz\>'", $node_content, $font_size ); //Iegūstam teksu preg_match( "'\<w:t(.*?)\>(.*?)\<\/w:t\>'", $node_content, $text ); //Fonta nosaukumu nemēģinām iegūt, jo droši vien vēlēsimies izmantot savu $tag_name = 'span'; $style = ' style="'; $content = $text[ 2 ]; if( count( $bold ) > 0 ) $style .= 'font-weight: bold; '; if( count( $italic ) > 0 ) $style .= 'font-style: italic; '; if( count( $font_size ) > 0 ) $style .= 'font-size: ' . $font_size[1] . 'px '; $style .= '" '; return '<' . $tag_name . $style . '>' . $content . ''; } function extracttext($filename) { $ext = explode('.', $filename); $ext = $ext[count(explode('.', $filename)) - 1]; if($ext == 'docx') $dataFile = "word/document.xml"; else $dataFile = "content.xml"; $zip = new ZipArchive; if (true === $zip->open($filename)) { if (($index = $zip->locateName($dataFile)) !== false) { $text = $zip->getFromIndex($index); $xml = new DOMDocument(); $xml->loadXML($text); $ret = $xml->saveHTML(); $ret = str_replace("", " ", $ret); preg_match_all( "'<w:r(.*?)\<\/w:r\>'", $ret, $get, PREG_OFFSET_CAPTURE); foreach ( $get[0] as $node_key => $node ) $ret = str_replace($node[0], nodeToHtml($node[0]), $ret); preg_match_all( "'(.*?)\ '", $ret, $p); $data = array(); foreach ($p[0] as $key => $paragraph) { $data[ $key ] = ''; preg_match_all( "'\<span(.*?)\>(.*?)\<\/span\>'", $paragraph, $spans ); foreach ($spans[0] as $span) $data[ $key ] .= $span; } return $data; } $zip->close(); } return 'File not found'; }
Bet nezin kāpēc, bet word tā dara, bieži sanāk ka mums vārdi(dažreiz pat burti) ir atsevišķos span tegos. Tāpēc, pirms atgriezt paragrāfu masīvu mēs varam tajā iztīrīt liekos tegus:
foreach ( $data as $key => $p ) { $data[$key] =''; preg_match_all( "'\<span(.*?)\>(.*?)\<\/span\>'", $p, $spans ); $curr_format=''; $curr_text=''; foreach ( $spans[ 1 ] as $span_key => $span_format ) { if( $span_format == $curr_format ) { $curr_text .= $spans[2][$span_key]; } else { if( $curr_format != '' AND $curr_text != '' ) $data[$key] .= '<span '.$curr_format.'>'.$curr_text.''; $curr_text = $spans[2][$span_key]; $curr_format = $span_format; } } $data[$key] .= '<span '.$curr_format.'>'.$curr_text.''; }