[geeklog-devel] COM_makeClickableLinks

Michael Jervis mjervis at gmail.com
Sun Aug 3 12:30:56 EDT 2008


Cheers Sami,

It's looking a lot like we need to use regexp to locate/identify the
urls, but then php to parse them and turn them into links.

That gives an opportunity to shorten very long urls (within the anchor
not the href attribute (perhaps via tinyurl's API rather than
truncation))

On Wed, Jul 30, 2008 at 14:17, Sami Barakat <furiousdog at gmail.com> wrote:
> Hi,
>
> I think I've got it now, although its not a complete solution
>
> function COM_makeClickableLinks( $text )
> {
>    $text = preg_replace(
> '/([^"]?)(((ht|f)tps?):(\/\/)|(www\.))+((?=([^\s]+) ))?(\8|[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)/is',
> '\\1<a href="http://\\6\\9">\\6\\9</a>', $text );
>    return $text;
> }
>
> It seems to work well with the following strings:
>
> normal link http://www.url.com
> normal link with early quote http://www.url.com/folder"stuff
> link with   and quotes "http://www.url.com "
> www.url.com/ps 
> complicated link www.sub.url.com/folder/index.php?id=foo&user=bar 
>
> it still fails however on these strings
>
> link with two   www.url.com/ps  
> link with early quote and   "http://www.url.com/folder"stuff 
>
> The results of the two failed strings is
>
> link with two   <a
> href="http://www.url.com/ps ">www.url.com/ps </a> 
> link with early quote and   "<a
> href="http://www.url.com/folder"stuff">www.url.com/folder"stuff</a> 
>
> The second string could probably be fixed by replacing this part of
> the regular expression '[^\s]+' with this
> '[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+'
> But really regular expressions are more helpful when validating
> strings or trying to find substrings in complicated strings, they are
> not really made to exclude parts of a string. So it might be more
> effective and less complicated to run through the expression twice.
> The first time matching urls with   on the end and the second
> time without.
>
> Hope this helps
> Sami
>
> 2008/7/29 Sami Barakat <furiousdog at gmail.com>:
>> Hey,
>>
>> I have tried looking into this and I have come up with a partial
>> solution. From my understanding the problem is when a url has a  
>> at the end which is getting parsed along with the url. I ask because I
>> think Gmail has filtered out some of them. Anyway the following regex
>>
>> ([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])
>>
>> Seems to work fairly well. Here is the test code that I am using.
>>
>> echo '<pre>';
>> $string = "normal link http://www.url.com PASS\n";
>> echo htmlentities(COM_makeClickableLinks($string));
>> $string = "link with   and quotes \"http://www.url.com \" PASS\n";
>> echo htmlentities(COM_makeClickableLinks($string));
>> $string = "complicated link
>> \"www.sub.url.com/folder/index.php?id=foo&user=bar \"
>> PASS\n";
>> echo htmlentities(COM_makeClickableLinks($string));
>> $string = "problem link \"www.url.com/words \" FAIL\n";
>> echo htmlentities(COM_makeClickableLinks($string));
>> echo '</pre>';
>>
>> This produces
>>
>> normal link <a href="http://www.url.com">www.url.com</a> PASS
>> link with   and quotes "<a
>> href="http://www.url.com">www.url.com</a> " PASS
>> complicated link "<a
>> href="http://sub.url.com/folder/index.php?id=foo&user=bar">sub.url.com/folder/index.php?id=foo&user=bar</a> "
>> PASS
>> problem link "<a href="http://url.com/word">url.com/word</a>s " FAIL
>>
>> As you can see the first 3 work, the problem occurs when a url ends
>> with any of the characters: '&' or 'n' or 'b' or 's' or 'p' or ';'
>>
>> So www.url.com/ps would return <a href="http://url.com/">url.com/</a>ps
>>
>> This is due to the last bit of the regex "(?<![ ])" if I tried
>> just doing (?<! ) but it does not work at all because the
>> previous statement is being too greedy. There is also an issue with
>> the www. being removed, but thats not too much of a problem at the
>> moment.
>>
>> Also the COM_makeClickableLinks function can be simplified by removing
>> the str_replace statment resulting in simply this
>>
>> function COM_makeClickableLinks( $text )
>> {
>>    $text = preg_replace(
>> '/([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])/is',
>> '\\1<a href="http://\\6">\\6</a>', $text );
>>    return $text;
>> }
>>
>>
>> in the original regex I was unsure why the "(\/|[+0-9a-z])" part was
>> included. I dont think its necessary so I took it out, maybe there was
>> a particular case that required it which Im overlooking.
>>
>> Anyhow I will have another crack at it later on, it really is a tough
>> one, but this is as far as ive got so far.
>>
>> Sami
>>
>> 2008/7/28 Michael Jervis <mjervis at gmail.com>:
>>> All (especially Sami!),
>>>
>>> There is a bug in the subject function. If it finds
>>> "http://www.url.com" we end up with &nbsp<a
>>> href=";http://www.url.com&nbsp">;http://www.url.com&nbsp</a>;
>>>
>>> Which isn't good.
>>>
>>> The original regexp in COM_MakeClickableLinks is:
>>>
>>> /([^"]?)((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>>
>>> I think the first match ([^"]?) is spurious, it matches anything other
>>> than  " before a link. So bhttp://www.foo.com" matches, but
>>> "http://www.foo.com doesn't.
>>>
>>> So that gives:
>>> /((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>>
>>> Resulting in:
>>>  <a href="http:///www.url.com&nbsp">http://www.url.com&nbsp</a>
>>>
>>> So, need to add an "ignore trailing  " bit to the clause. Closest
>>> I can get is:
>>> ((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))(?= )
>>>
>>> Which results in:
>>>  <a href="http:///www.url.com">http://www.url.com</a>>>
>>> However, unless there were quotes round the link, it won't match! So
>>> "http://www.foo.com" matches and is correctly processed, but
>>> http://www.foo.com is not matched.
>>>
>>> My head is now hurt. Any suggestions?
>>>
>>> --
>>> Michael Jervis
>>> mjervis at gmail.com
>>> 504B03041400000008008F846431E3543A820800000006000000060000007765
>>> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
>>> 0800000006000000060000000000000000002000000000000000776562676F64
>>> 504B05060000000001000100340000002C0000000000
>>> _______________________________________________
>>> geeklog-devel mailing list
>>> geeklog-devel at lists.geeklog.net
>>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>>>
>>
> _______________________________________________
> geeklog-devel mailing list
> geeklog-devel at lists.geeklog.net
> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>



-- 
Michael Jervis
mjervis at gmail.com
504B03041400000008008F846431E3543A820800000006000000060000007765
62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
0800000006000000060000000000000000002000000000000000776562676F64
504B05060000000001000100340000002C0000000000



More information about the geeklog-devel mailing list