Rectangle 27 28

html_entity_decode()

You might wonder why trim(html_entity_decode('')); doesn't reduce the string to an empty string, that's because the '' entity is not ASCII code 32 (which is stripped by trim()) but ASCII code 160 (0xa0) in the default ISO 8859-1 characterset.

You can use str_replace() to replace the ascii character #160 to a space:

<?php
$a = html_entity_decode('>&nbsp;<');
echo 'before ' . $a . PHP_EOL;
$a = str_replace("\xA0", ' ', $a);
echo ' after ' . $a . PHP_EOL;

If you are working with UTF-8 encoded strings you should replace \xC2\xA0 . $a = html_entity_decode('><', ENT_QUOTES, 'UTF-8'); echo 'before ' . $a . PHP_EOL; $a = str_replace("\xC2\xA0", ' ', $a); echo ' after ' . $a . PHP_EOL;

I've been struggling a lot with the data I retrieve from a contenteditable element, all rtrim and preg_replace attempts failed. I've also been trying to filter stuff with JavaScript before shooting it with $.ajax(), also failed. So now I do str_replace("", ' ', $value) and then preg_replace('/\s+$/','',$value). It works, though not too elegant. If someone has suggestions, please tell me,

php - Does html_entity_decode replaces   also? If not how to repl...

php html string whitespace html-entities
Rectangle 27 6

html_entity_decode('&#x30a8;', 0, 'UTF-8');

This works too. However the json_decode() solution is a lot faster (around 50 times).

This solution works for me. Thanks for the answer.

Unicode character in PHP string - Stack Overflow

php unicode
Rectangle 27 2

html_entity_decode

this is more correctly because when we just replace with empty string we get incorrect result - all non breakable spaces are collapsed

strip_tags
preg_replace

might be nice to replace the '+' with {2,8} or something. This will limit the chance of replacing entire sentences when an unencoded '&' is present.

$Content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$Content);

might be nice to replace the '+' with '{2,8] or something. This will limit the chance of replacing entire sentences when an unencoded '&' is present.

Thanks, added your comment and an alternative version to the answer.

but why would one want to remove those characters?

Those character-entities are not valid in RSS/Atom/XML. so you can do 2 thing: remove them, or replace them with their number-equivalent.

A possible case for having to remove them is when stripping HTML for sending it as an alternate, plain-text body along in an email.

php - How to remove html special chars? - Stack Overflow

php html-encode
Rectangle 27 9

iconv('UTF-8', 'windows-1252', html_entity_decode($str));

the html_entity_decode decodes the html entities. but due to any reason you must convert it to utf8 with iconv. i suppose this is a fpdf-secret... cause in normal browser view it is displayed correctly.

in any pdf-output you wanna make. for example: $pdf->Write(5, iconv('UTF-8', 'windows-1252', html_entity_decode($str)));

php - html_entity_decode in FPDF(using tFPDF extention) - Stack Overfl...

php pdf fpdf html-encode
Rectangle 27 69

I started wondering what behavior these constants have when I saw these constants at the htmlspecialchars page. The documentation was rubbish, so I started digging in the source code of PHP.

Basically, these constants affect whether certain entities are encoded or not (or decoded for html_entity_decode). The most obvious effect is whether the apostrophe (') is encoded to (for ENT_HTML401) or (for others). Similarly, it determines whether is decoded or not when using html_entity_decode. ( is always decoded).

All usages can be found in ext/standard/html.c and its header file. From ext/standard/html.h:

#define ENT_HTML_DOC_HTML401            0
#define ENT_HTML_DOC_XML1                       16
#define ENT_HTML_DOC_XHTML                      32
#define ENT_HTML_DOC_HTML5                      (16|32)

(replace ENT_HTML_DOC_ by ENT_ to get their PHP constant names)

I started looking for all occurrences of these constants, and can share the following on the behaviour of the ENT_* constants:

  • It affects which numeric entities will be decoded or not. For example, gets decoded to an unreadable/invalid character for ENT_HTML401, and ENT_XHTML and ENT_XML1. For ENT_HTML5 however, this is considered an invalid character and hence it stays . (C function unicode_cp_is_allowed)
  • With ENT_SUBSTITUTE enabled, invalid code unit sequences for a specified character set are replaced with . (does not depend on document type!)
ENT_DISALLOWED

that are disallowed for the specified document type


                                                                                                                                                                                                                                                                
  • With ENT_IGNORE, the same invalid code unit sequences from ENT_SUBSTITUTE are removed and no replacement is done (depends on choice of "document type", e.g. ENT_HTML5)
ENT_HTML5
  • ENT_XHTML shares the entity map with ENT_HTML401. The only difference is that will be converted to an apostrophe with ENT_XHTML while ENT_HTML401 does not convert it (see this line)
  • ENT_HTML401 and ENT_XHTML use exactly the same entity map (minus the difference from the previous point). ENT_HTML5 uses its own map. Others (currently ENT_XML1) have a very limited decoding map (, , , , and their numeric equivalents). (see C function unescape_inverse_map)
  • Note for the previous point: when only a few entities must be escaped (think of htmlspecialchars), all entities map will use the same one as ENT_XML1, except for ENT_HTML401. That one will not use , but .

That covers almost everything. I am not going to list all entity differences, instead I would like to point at https://github.com/php/php-src/tree/php-5.4.11/ext/standard/html_tables for some text files that contain the mappings for each type.

When using htmlspecialchars with ENT_COMPAT (default) or ENT_NOQUOTES, it does not matter which one you pick (see below). I saw some answers here on SO that boils down to this:

<input value="<?php echo htmlspecialchars($str, ENT_HTML5);?>" >

This is insecure. It will override the default value ENT_HTML401 | ENT_COMPAT which has as difference that HTML5 entities are used, but also that quotes are not escaped anymore! In addition, this is redundant code. The entities that have to be encoded by htmlspecialchars are the same for all ENT_HTML401, ENT_HTML5, etc.

Just use ENT_COMPAT or ENT_QUOTES instead. The latter also works when you use apostrophes for attributes (value='foo'). If you only have two arguments for htmlspecialchars, do not include the argument at all since it is the default (ENT_HTML401 is 0, remember?).

When you want to print something on the page (between tags, not attributes), it does not matter at all which one you pick as it will have equal effect. It is even sufficient to use ENT_NOQUOTES | ENT_HTML401 which equals to the numeric value 0.

See also below, about ENT_SUBTITUTE and ENT_DISALLOWED.

If your text editor or database is so crappy that you cannot include non-US-ASCII characters (e.g. UTF-8), you can use htmlentities. Otherwise, save some bytes and use htmlspecialchars instead (see above).

Whether you need to use ENT_HTML401, ENT_HTML5 or something else depends on how your page is served. When you have a HTML5 page (<!doctype html>), use ENT_HTML5. XHTML or XML? Use the corresponding ENT_XHTML or ENT_XML1. With no doctype or plain ol' HTML4, use ENT_HTML401 (which is the default when omitted).

By default, byte sequences that are invalid for the given character set are removed. To have a in place of an invalid byte sequence, specify ENT_SUBSTITUTE. (note that is shown for non-UTF-8 charsets). When you specify ENT_IGNORE though, these characters are not shown even if you specified ENT_SUBSTITUTE.

Invalid characters for a document type are substituted by the same replacement character (or its entity) above when ENT_DISALLOWED is specified. This happens regardless of having ENT_IGNORE set (which has nothing to do with invalid chars for doctypes).

Please note that although the doc discourage the use of ENT_IGNORE for security implications ( php.net/manual/en/function.htmlspecialchars.php ), others consts are only available starting from PHP 5.4.0, whereas ENT_IGNORE is already in PHP 5.3.0.

php - What do the ENT_HTML5, ENT_HTML401, ... modifiers on html_entity...

php html html-entities htmlspecialchars
Rectangle 27 70

I started wondering what behavior these constants have when I saw these constants at the htmlspecialchars page. The documentation was rubbish, so I started digging in the source code of PHP.

Basically, these constants affect whether certain entities are encoded or not (or decoded for html_entity_decode). The most obvious effect is whether the apostrophe (') is encoded to (for ENT_HTML401) or (for others). Similarly, it determines whether is decoded or not when using html_entity_decode. ( is always decoded).

All usages can be found in ext/standard/html.c and its header file. From ext/standard/html.h:

#define ENT_HTML_DOC_HTML401            0
#define ENT_HTML_DOC_XML1                       16
#define ENT_HTML_DOC_XHTML                      32
#define ENT_HTML_DOC_HTML5                      (16|32)

(replace ENT_HTML_DOC_ by ENT_ to get their PHP constant names)

I started looking for all occurrences of these constants, and can share the following on the behaviour of the ENT_* constants:

  • It affects which numeric entities will be decoded or not. For example, gets decoded to an unreadable/invalid character for ENT_HTML401, and ENT_XHTML and ENT_XML1. For ENT_HTML5 however, this is considered an invalid character and hence it stays . (C function unicode_cp_is_allowed)
  • With ENT_SUBSTITUTE enabled, invalid code unit sequences for a specified character set are replaced with . (does not depend on document type!)
ENT_DISALLOWED

that are disallowed for the specified document type


                                                                                                                                                                                                                                                                
  • With ENT_IGNORE, the same invalid code unit sequences from ENT_SUBSTITUTE are removed and no replacement is done (depends on choice of "document type", e.g. ENT_HTML5)
ENT_HTML5
  • ENT_XHTML shares the entity map with ENT_HTML401. The only difference is that will be converted to an apostrophe with ENT_XHTML while ENT_HTML401 does not convert it (see this line)
  • ENT_HTML401 and ENT_XHTML use exactly the same entity map (minus the difference from the previous point). ENT_HTML5 uses its own map. Others (currently ENT_XML1) have a very limited decoding map (, , , , and their numeric equivalents). (see C function unescape_inverse_map)
  • Note for the previous point: when only a few entities must be escaped (think of htmlspecialchars), all entities map will use the same one as ENT_XML1, except for ENT_HTML401. That one will not use , but .

That covers almost everything. I am not going to list all entity differences, instead I would like to point at https://github.com/php/php-src/tree/php-5.4.11/ext/standard/html_tables for some text files that contain the mappings for each type.

When using htmlspecialchars with ENT_COMPAT (default) or ENT_NOQUOTES, it does not matter which one you pick (see below). I saw some answers here on SO that boils down to this:

<input value="<?php echo htmlspecialchars($str, ENT_HTML5);?>" >

This is insecure. It will override the default value ENT_HTML401 | ENT_COMPAT which has as difference that HTML5 entities are used, but also that quotes are not escaped anymore! In addition, this is redundant code. The entities that have to be encoded by htmlspecialchars are the same for all ENT_HTML401, ENT_HTML5, etc.

Just use ENT_COMPAT or ENT_QUOTES instead. The latter also works when you use apostrophes for attributes (value='foo'). If you only have two arguments for htmlspecialchars, do not include the argument at all since it is the default (ENT_HTML401 is 0, remember?).

When you want to print something on the page (between tags, not attributes), it does not matter at all which one you pick as it will have equal effect. It is even sufficient to use ENT_NOQUOTES | ENT_HTML401 which equals to the numeric value 0.

See also below, about ENT_SUBTITUTE and ENT_DISALLOWED.

If your text editor or database is so crappy that you cannot include non-US-ASCII characters (e.g. UTF-8), you can use htmlentities. Otherwise, save some bytes and use htmlspecialchars instead (see above).

Whether you need to use ENT_HTML401, ENT_HTML5 or something else depends on how your page is served. When you have a HTML5 page (<!doctype html>), use ENT_HTML5. XHTML or XML? Use the corresponding ENT_XHTML or ENT_XML1. With no doctype or plain ol' HTML4, use ENT_HTML401 (which is the default when omitted).

By default, byte sequences that are invalid for the given character set are removed. To have a in place of an invalid byte sequence, specify ENT_SUBSTITUTE. (note that is shown for non-UTF-8 charsets). When you specify ENT_IGNORE though, these characters are not shown even if you specified ENT_SUBSTITUTE.

Invalid characters for a document type are substituted by the same replacement character (or its entity) above when ENT_DISALLOWED is specified. This happens regardless of having ENT_IGNORE set (which has nothing to do with invalid chars for doctypes).

Please note that although the doc discourage the use of ENT_IGNORE for security implications ( php.net/manual/en/function.htmlspecialchars.php ), others consts are only available starting from PHP 5.4.0, whereas ENT_IGNORE is already in PHP 5.3.0.

php - What do the ENT_HTML5, ENT_HTML401, ... modifiers on html_entity...

php html html-entities htmlspecialchars
Rectangle 27 30

maps to a UTF-8 character (the em dash) so you need to specify UTF-8 as the character encoding:

$converted = html_entity_decode($string, ENT_COMPAT, 'UTF-8');

I still get the entity when I view source on that one...?

@mootymoots: I tested it, I got the raw character instead of the entity. Wonder what else could be causing it... the HTML document's encoding perhaps?

it's converted on the page - but not in the source...? Looking in chrome

Just to add, the PHP is sending it via json_encode to Apple, it's not actually needed to be viewed in browser, it's just helping me debug. It comes through as the entity on the device.

Scratch that comment, you're using APNS. So that means your alert view is displaying as well, right?

html entities - html_entity_decode problem in PHP? - Stack Overflow

php html-entities html-encode
Rectangle 27 4

My version using regular expressions:

$string = '<code> &lt;div&gt; blabla &lt;/div&gt; </code>';
$new_string = preg_replace(
    '/(.*?)(<.*?>|$)/se', 
    'html_entity_decode("$1").htmlentities("$2")', 
    $string
);

thank you. I don't know which method should I go with, yours or adlawson's :)

@Alex You are welcome! Both methods has its own side effects. You need to test which works the best in your case.

php - Inverse htmlentities / html_entity_decode - Stack Overflow

php string html-entities html-encode
Rectangle 27 3

function html_entity_decode(s) {
  var t=document.createElement('textarea');
  t.innerHTML = s;
  var v = t.value;
  t.parentNode.removeChild(t);
  return v;
}

I have no document object and this should not rely on it, since I'm using JSM XUL.

You can use the code (phpjs.org/functions/htmlentities:425) which relies on a table lookup. You need to reverse the lookup to decode the entities.

Decode HTML entities in JavaScript? - Stack Overflow

javascript html html-entities
Rectangle 27 4

iconv('UTF-8', 'windows-1252', html_entity_decode($str));

unicode - FPDF utf-8 encoding (HOW-TO) - Stack Overflow

unicode utf-8 character-encoding fpdf
Rectangle 27 4

html_entity_decode('&#1576;&#1575;&#1582;', ENT_QUOTES, 'UTF-8');

When you go from to , that's called decoding. Doing the opposite is called encoding.

As for replacing only characters from to maybe try something like this.

<?php

// Random set of entities, two are outside the 1563 - 1785 range.
$entities = '&#1563;&#1564;&#60;&#1604;&#241;&#1784;&#1785;';

// Matches entities from 1500 to 1799, not perfect, I know.
preg_match_all('/&#1[5-7][0-9]{2};/', $entities, $matches);

$entityRegex = array(); // Will hold the entity code regular expression.
$decodedCharacters = array(); // Will hold the decoded characters.

foreach ($matches[0] as $entity)
{
    // Convert the entity to human-readable character.
    $unicodeCharacter = html_entity_decode($entity, ENT_QUOTES, 'UTF-8');

    array_push($entityRegex, "/$entity/");
    array_push($decodedCharacters, $unicodeCharacter);
}

// Replace all of the matched entities with the human-readable character.
$replaced = preg_replace($entityRegex, $decodedCharacters, $entities);

?>

That's as close as I can get to solving this. Hopefully, this helps a little. It's 5:00am where I am now, so I'm off to sleep! :)

but the problem is there are other html characters in the same html string that I do not want to decode. How can I skip them?

Ooh. That's tricky. The first thing that comes to mind is to use a regular expression... Does anyone else have a better idea?

php - How to Convert Html Codes to Relevant Unicode Characters - Stack...

php unicode
Rectangle 27 4

html_entity_decode($string, ENT_QUOTES, 'UTF-8')

that resolves the error but brings up & errors now? If i change & to it fixed the error but how can I decode everything properly?

@Lyon: Maybe you should show us what youre actually doing.

echo '<?xml version="1.0" encoding="UTF-8" ?>'."\n";

php output xml produces parse error "’" - Stack Overflow

php xml xml-parsing
Rectangle 27 3

Don't use UTF-8 encoding. Standard FPDF fonts use ISO-8859-1 or Windows-1252. It is possible to perform a conversion to ISO-8859-1 with utf8_decode():

$str = utf8_decode($str);

But some characters such as Euro won't be translated correctly. If the iconv extension is available, the right way to do it is the following:

$str = iconv('UTF-8', 'windows-1252', $str);

So, as emfi suggests, a combination of iconv() and html_entity_decode() PHP functions is the solution to your question:

$str = iconv('UTF-8', 'windows-1252', html_entity_decode("&copy;"));

php - html_entity_decode in FPDF(using tFPDF extention) - Stack Overfl...

php pdf fpdf html-encode
Rectangle 27 1

you can use , html_entity_decode($strint,ENT_QUOTES, 'UTF-8')

php - struggling with special characters (html_entity_decode, iconv, a...

php mysql character-encoding
Rectangle 27 7

You may want take a look at htmlentities() and html_entity_decode() here

$orig = "I'll \"walk\" the <b>dog</b> now";

$a = htmlentities($orig);

$b = html_entity_decode($a);

echo $a; // I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo $b; // I'll "walk" the <b>dog</b> now

this html_entity_decode($a); is doing the tric

php - How to remove html special chars? - Stack Overflow

php html-encode
Rectangle 27 4

iconv('UTF-8', 'windows-1252', html_entity_decode($str));

unicode - FPDF utf-8 encoding (HOW-TO) - Stack Overflow

unicode utf-8 character-encoding fpdf
Rectangle 27 4

Carefully read the Notes, maybe that s the issue you are facing

You might wonder why trim(html_entity_decode('')); doesn't reduce the string to an empty string, that's because the '' entity is not ASCII code 32 (which is stripped by trim()) but ASCII code 160 (0xa0) in the default ISO 8859-1 characterset.

php - Does html_entity_decode replaces   also? If not how to repl...

php html string whitespace html-entities
Rectangle 27 3

html_entity_decode does convert to a space, just not a "simple" one (ASCII 32), but a non-breaking space (ASCII 160) (as this is the definition of ).

If you need to convert to ASCII 32, you still need a str_replace(), or, depending on your situation, a preg_match("/s+", ' ', $string) to convert all kinds of whitespace to simple spaces.

php - Does html_entity_decode replaces   also? If not how to repl...

php html string whitespace html-entities
Rectangle 27 4

There isn't an existing function, but have a look at this. So far I've only tested it on your example, but this function should work on all htmlentities

function html_entity_invert($string) {
    $matches = $store = array();
    preg_match_all('/(&(#?\w){2,6};)/', $string, $matches, PREG_SET_ORDER);

    foreach ($matches as $i => $match) {
        $key = '__STORED_ENTITY_' . $i . '__';
        $store[$key] = html_entity_decode($match[0]);
        $string = str_replace($match[0], $key, $string);
    }

    return str_replace(array_keys($store), $store, htmlentities($string));
}

You can do the same inversion in one step using very similar code to mine.

I'm not sure how you would achieve the same result. The closest I've come is using return preg_replace('/(&(#?\w){2,6};)([^&;]*)/', html_entity_decode("$1") . htmlentities("$2"), $string);, but it doesn't work. It's much more simple to tackle each problem separately than with a single regex.

@adlawson Why do you think regex is much more difficult? By the way you use regex too :) The common problem about regex is readability. But in this case regex is short and even faster. For instance all your code can be rewritten as this: return preg_replace('/(.*?)(&(#?\w){2,6};|$)/se', 'htmlentities("$1").html_entity_decode("$2")', $string);

By the way &(#?\w){2,6}; is not very good for matching html entity because it will match &ab#cd;, but it will not match . I think &#?\w+; or something similar would be better.

php - Inverse htmlentities / html_entity_decode - Stack Overflow

php string html-entities html-encode
Rectangle 27 9

Injecting untrusted HTML into the page is dangerous as explained in How to decode HTML entities using jQuery?.

One alternative is to use a JavaScript-only implementation of PHP's html_entity_decode (from http://phpjs.org/functions/html_entity_decode:424). The example would then be something like:

var varTitle = html_entity_decode("Chris&apos; corner");

Actually, the current version of html_entity_decode doesn't handle .

javascript - HTML Entity Decode - Stack Overflow

javascript jquery html