[geeklog-devel] COM_makeClickableLinks

Sami Barakat furiousdog at gmail.com
Wed Jul 30 09:17:09 EDT 2008


Hi,

I think I've got it now, although its not a complete solution

function COM_makeClickableLinks( $text )
{
    $text = preg_replace(
'/([^"]?)(((ht|f)tps?):(\/\/)|(www\.))+((?=([^\s]+) ))?(\8|[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)/is',
'\\1<a href="http://\\6\\9">\\6\\9</a>', $text );
    return $text;
}

It seems to work well with the following strings:

normal link http://www.url.com
normal link with early quote http://www.url.com/folder"stuff
link with   and quotes "http://www.url.com "
www.url.com/ps 
complicated link www.sub.url.com/folder/index.php?id=foo&user=bar 

it still fails however on these strings

link with two   www.url.com/ps  
link with early quote and   "http://www.url.com/folder"stuff 

The results of the two failed strings is

link with two   <a
href="http://www.url.com/ps ">www.url.com/ps </a> 
link with early quote and   "<a
href="http://www.url.com/folder"stuff">www.url.com/folder"stuff</a> 

The second string could probably be fixed by replacing this part of
the regular expression '[^\s]+' with this
'[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+'
But really regular expressions are more helpful when validating
strings or trying to find substrings in complicated strings, they are
not really made to exclude parts of a string. So it might be more
effective and less complicated to run through the expression twice.
The first time matching urls with   on the end and the second
time without.

Hope this helps
Sami

2008/7/29 Sami Barakat <furiousdog at gmail.com>:
> Hey,
>
> I have tried looking into this and I have come up with a partial
> solution. From my understanding the problem is when a url has a  
> at the end which is getting parsed along with the url. I ask because I
> think Gmail has filtered out some of them. Anyway the following regex
>
> ([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])
>
> Seems to work fairly well. Here is the test code that I am using.
>
> echo '<pre>';
> $string = "normal link http://www.url.com PASS\n";
> echo htmlentities(COM_makeClickableLinks($string));
> $string = "link with   and quotes \"http://www.url.com \" PASS\n";
> echo htmlentities(COM_makeClickableLinks($string));
> $string = "complicated link
> \"www.sub.url.com/folder/index.php?id=foo&user=bar \"
> PASS\n";
> echo htmlentities(COM_makeClickableLinks($string));
> $string = "problem link \"www.url.com/words \" FAIL\n";
> echo htmlentities(COM_makeClickableLinks($string));
> echo '</pre>';
>
> This produces
>
> normal link <a href="http://www.url.com">www.url.com</a> PASS
> link with   and quotes "<a
> href="http://www.url.com">www.url.com</a> " PASS
> complicated link "<a
> href="http://sub.url.com/folder/index.php?id=foo&user=bar">sub.url.com/folder/index.php?id=foo&user=bar</a> "
> PASS
> problem link "<a href="http://url.com/word">url.com/word</a>s " FAIL
>
> As you can see the first 3 work, the problem occurs when a url ends
> with any of the characters: '&' or 'n' or 'b' or 's' or 'p' or ';'
>
> So www.url.com/ps would return <a href="http://url.com/">url.com/</a>ps
>
> This is due to the last bit of the regex "(?<![ ])" if I tried
> just doing (?<! ) but it does not work at all because the
> previous statement is being too greedy. There is also an issue with
> the www. being removed, but thats not too much of a problem at the
> moment.
>
> Also the COM_makeClickableLinks function can be simplified by removing
> the str_replace statment resulting in simply this
>
> function COM_makeClickableLinks( $text )
> {
>    $text = preg_replace(
> '/([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])/is',
> '\\1<a href="http://\\6">\\6</a>', $text );
>    return $text;
> }
>
>
> in the original regex I was unsure why the "(\/|[+0-9a-z])" part was
> included. I dont think its necessary so I took it out, maybe there was
> a particular case that required it which Im overlooking.
>
> Anyhow I will have another crack at it later on, it really is a tough
> one, but this is as far as ive got so far.
>
> Sami
>
> 2008/7/28 Michael Jervis <mjervis at gmail.com>:
>> All (especially Sami!),
>>
>> There is a bug in the subject function. If it finds
>> "http://www.url.com" we end up with &nbsp<a
>> href=";http://www.url.com&nbsp">;http://www.url.com&nbsp</a>;
>>
>> Which isn't good.
>>
>> The original regexp in COM_MakeClickableLinks is:
>>
>> /([^"]?)((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>
>> I think the first match ([^"]?) is spurious, it matches anything other
>> than  " before a link. So bhttp://www.foo.com" matches, but
>> "http://www.foo.com doesn't.
>>
>> So that gives:
>> /((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>
>> Resulting in:
>>  <a href="http:///www.url.com&nbsp">http://www.url.com&nbsp</a>
>>
>> So, need to add an "ignore trailing  " bit to the clause. Closest
>> I can get is:
>> ((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))(?= )
>>
>> Which results in:
>>  <a href="http:///www.url.com">http://www.url.com</a>>
>> However, unless there were quotes round the link, it won't match! So
>> "http://www.foo.com" matches and is correctly processed, but
>> http://www.foo.com is not matched.
>>
>> My head is now hurt. Any suggestions?
>>
>> --
>> Michael Jervis
>> mjervis at gmail.com
>> 504B03041400000008008F846431E3543A820800000006000000060000007765
>> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
>> 0800000006000000060000000000000000002000000000000000776562676F64
>> 504B05060000000001000100340000002C0000000000
>> _______________________________________________
>> geeklog-devel mailing list
>> geeklog-devel at lists.geeklog.net
>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>>
>



More information about the geeklog-devel mailing list