background preloader

Community plumbing

Community plumbing

IlohaBlog PHP tip: How to decode HTML entities on a web page Technologies: PHP 4.3.0+, UTF-8 HTML entities encode special characters and symbols, such as € for €, or © for ©. When building a PHP search engine or web page analysis tool, HTML entities within a page must be decoded into single characters to get clean parsable text. PHP’s standard html_entity_decode() function will do the job, but you must use a rich character encoding, such as UTF-8, and multibyte character strings. This tip shows how. This article is both an independent article and part of an article series on How to extract keywords from a web page. Code HTML's character reference syntax enables a web page to use special characters that aren't supported by the page's normal character encoding. There are three forms for an HTML character reference: Name. The named form of a character reference is called an HTML entity. To do text processing on a web page, you need to convert HTML entities and numeric character references into normal characters. Using html_entity_decode

PHP tip: How to strip HTML tags, scripts, and styles from a web page Technologies: PHP 4.3+. UTF-8 The HTML tags on a web page must be stripped away to get clean text for a PHP search engine, keyword extractor, or some other page analysis tool. PHP's standard strip_tags( ) function will do part of the job, but you need to strip out styles, scripts, embedded objects, and other unwanted page code first. This tip shows how. This article is both an independent article and part of an article series on How to extract keywords from a web page. Code PHP's handy strip_tags( ) function removes HTML tags that look like <word...>, <word.../>, or </word>. To fix these problems, you need to process certain tags first before using strip_tags(). Remove HTML tag pairs and enclosed content for styles, scripts, embedded objects, etc. Once this is done, call strip_tags() to remove the remaining tags. Below is sample code to do this. Downloads: strip_html_tags.zip. /** * Remove HTML tags, including invisible text such as style and * script code, and embedded objects. Example

Gollum, the Wikipedia Browser ディノオープンラボラトリ — 株式会社ディノ社員による技術メモ

Related: