[geeklog-devel] COM_makeClickableLinks

Sun Aug 10 09:30:26 EDT 2008

Hi Michael,

I dont know if you are still working on this but I came across a web
site with a good regular expression just for this situation...but its
a bit longer then what we have been working with so far....wait for
it....

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

yup thats it. Although it looks complicated its actually very well
made and can handle a wide variety of urls. More info can be found at
this site where I found it
http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

That regex wont work straight away, it had to be modified a little to
take account for the problem we had originally. So here is the
modified version

((ht|f)tp(s?)\:\/\/|~\/|\/)?([\w]+:\w+@)?(([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((\/?\w+\/)+|\/?)([\w\-%]+(\.[\w]{3,4})?)?((\?|&|&)[\w\-%]+=[\w\-%]+)*)

It works with all the test urls we have been using and then some. I
did a little benchmarking and this one works out to be around half a
second longer when executed 10,000 times, so although it looks nasty
half a sec is nothing to complain about.

Here is the complete function

function COM_makeClickableLinks( $text )
{
    $regex = '/((ht|f)tp(s?)\:\/\/|~\/|\/)?([\w]+:\w+@)?(([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((\/?\w+\/)+|\/?)([\w\-%]+(\.[\w]{3,4})?)?((\?|&|&)[\w\-%]+=[\w\-%]+)*)/is';

    $text = preg_replace( $regex, '<a href="http://\\5">\\5</a>', $text );
    return $text;
}

A simple way to shorten the urls would be to just print the domain in
the anchor,
so this link
    http://www.sub.url.com/folder?user=5&this=that
would be converted to
    <a href="http://www.sub.url.com/folder?user=5&this=that">www.sub.url.com</a>
Maybe even include that little arrow icon that Wikipedia uses when
linking off site.

This can be done by changing the $replacement part of the preg function
    '<a href="\\1\\5">\\5</a>'
to
    '<a href="\\1\\5">\\6</a>'

\\1 = the http:// part of the url (or ftp:// or https:// etc.)
\\5 = the full url (without the http:// part)
\\6 = the domain name (www.url.com or url.com or www.sub.url.com etc.)

enjoy...

Sami

2008/8/3 Michael Jervis <mjervis at gmail.com>:
> Cheers Sami,
>
> It's looking a lot like we need to use regexp to locate/identify the
> urls, but then php to parse them and turn them into links.
>
> That gives an opportunity to shorten very long urls (within the anchor
> not the href attribute (perhaps via tinyurl's API rather than
> truncation))
>
> On Wed, Jul 30, 2008 at 14:17, Sami Barakat <furiousdog at gmail.com> wrote:
>> Hi,
>>
>> I think I've got it now, although its not a complete solution
>>
>> function COM_makeClickableLinks( $text )
>> {
>>    $text = preg_replace(
>> '/([^"]?)(((ht|f)tps?):(\/\/)|(www\.))+((?=([^\s]+) ))?(\8|[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)/is',
>> '\\1<a href="http://\\6\\9">\\6\\9</a>', $text );
>>    return $text;
>> }
>>
>> It seems to work well with the following strings:
>>
>> normal link http://www.url.com
>> normal link with early quote http://www.url.com/folder"stuff
>> link with   and quotes "http://www.url.com "
>> www.url.com/ps 
>> complicated link www.sub.url.com/folder/index.php?id=foo&user=bar 
>>
>> it still fails however on these strings
>>
>> link with two   www.url.com/ps  
>> link with early quote and   "http://www.url.com/folder"stuff 
>>
>> The results of the two failed strings is
>>
>> link with two   <a
>> href="http://www.url.com/ps ">www.url.com/ps </a> 
>> link with early quote and   "<a
>> href="http://www.url.com/folder"stuff">www.url.com/folder"stuff</a> 
>>
>> The second string could probably be fixed by replacing this part of
>> the regular expression '[^\s]+' with this
>> '[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+'
>> But really regular expressions are more helpful when validating
>> strings or trying to find substrings in complicated strings, they are
>> not really made to exclude parts of a string. So it might be more
>> effective and less complicated to run through the expression twice.
>> The first time matching urls with   on the end and the second
>> time without.
>>
>> Hope this helps
>> Sami
>>
>> 2008/7/29 Sami Barakat <furiousdog at gmail.com>:
>>> Hey,
>>>
>>> I have tried looking into this and I have come up with a partial
>>> solution. From my understanding the problem is when a url has a  
>>> at the end which is getting parsed along with the url. I ask because I
>>> think Gmail has filtered out some of them. Anyway the following regex
>>>
>>> ([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])
>>>
>>> Seems to work fairly well. Here is the test code that I am using.
>>>
>>> echo '<pre>';
>>> $string = "normal link http://www.url.com PASS\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> $string = "link with   and quotes \"http://www.url.com \" PASS\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> $string = "complicated link
>>> \"www.sub.url.com/folder/index.php?id=foo&user=bar \"
>>> PASS\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> $string = "problem link \"www.url.com/words \" FAIL\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> echo '</pre>';
>>>
>>> This produces
>>>
>>> normal link <a href="http://www.url.com">www.url.com</a> PASS
>>> link with   and quotes "<a
>>> href="http://www.url.com">www.url.com</a> " PASS
>>> complicated link "<a
>>> href="http://sub.url.com/folder/index.php?id=foo&user=bar">sub.url.com/folder/index.php?id=foo&user=bar</a> "
>>> PASS
>>> problem link "<a href="http://url.com/word">url.com/word</a>s " FAIL
>>>
>>> As you can see the first 3 work, the problem occurs when a url ends
>>> with any of the characters: '&' or 'n' or 'b' or 's' or 'p' or ';'
>>>
>>> So www.url.com/ps would return <a href="http://url.com/">url.com/</a>ps
>>>
>>> This is due to the last bit of the regex "(?<![ ])" if I tried
>>> just doing (?<! ) but it does not work at all because the
>>> previous statement is being too greedy. There is also an issue with
>>> the www. being removed, but thats not too much of a problem at the
>>> moment.
>>>
>>> Also the COM_makeClickableLinks function can be simplified by removing
>>> the str_replace statment resulting in simply this
>>>
>>> function COM_makeClickableLinks( $text )
>>> {
>>>    $text = preg_replace(
>>> '/([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])/is',
>>> '\\1<a href="http://\\6">\\6</a>', $text );
>>>    return $text;
>>> }
>>>
>>>
>>> in the original regex I was unsure why the "(\/|[+0-9a-z])" part was
>>> included. I dont think its necessary so I took it out, maybe there was
>>> a particular case that required it which Im overlooking.
>>>
>>> Anyhow I will have another crack at it later on, it really is a tough
>>> one, but this is as far as ive got so far.
>>>
>>> Sami
>>>
>>> 2008/7/28 Michael Jervis <mjervis at gmail.com>:
>>>> All (especially Sami!),
>>>>
>>>> There is a bug in the subject function. If it finds
>>>> "http://www.url.com" we end up with &nbsp<a
>>>> href=";http://www.url.com&nbsp">;http://www.url.com&nbsp</a>;
>>>>
>>>> Which isn't good.
>>>>
>>>> The original regexp in COM_MakeClickableLinks is:
>>>>
>>>> /([^"]?)((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>>>
>>>> I think the first match ([^"]?) is spurious, it matches anything other
>>>> than  " before a link. So bhttp://www.foo.com" matches, but
>>>> "http://www.foo.com doesn't.
>>>>
>>>> So that gives:
>>>> /((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>>>
>>>> Resulting in:
>>>>  <a href="http:///www.url.com&nbsp">http://www.url.com&nbsp</a>
>>>>
>>>> So, need to add an "ignore trailing  " bit to the clause. Closest
>>>> I can get is:
>>>> ((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))(?= )
>>>>
>>>> Which results in:
>>>>  <a href="http:///www.url.com">http://www.url.com</a> 
>>>>
>>>> However, unless there were quotes round the link, it won't match! So
>>>> "http://www.foo.com" matches and is correctly processed, but
>>>> http://www.foo.com is not matched.
>>>>
>>>> My head is now hurt. Any suggestions?
>>>>
>>>> --
>>>> Michael Jervis
>>>> mjervis at gmail.com
>>>> 504B03041400000008008F846431E3543A820800000006000000060000007765
>>>> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
>>>> 0800000006000000060000000000000000002000000000000000776562676F64
>>>> 504B05060000000001000100340000002C0000000000
>>>> _______________________________________________
>>>> geeklog-devel mailing list
>>>> geeklog-devel at lists.geeklog.net
>>>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>>>>
>>>
>> _______________________________________________
>> geeklog-devel mailing list
>> geeklog-devel at lists.geeklog.net
>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>>
>
>
>
> --
> Michael Jervis
> mjervis at gmail.com
> 504B03041400000008008F846431E3543A820800000006000000060000007765
> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
> 0800000006000000060000000000000000002000000000000000776562676F64
> 504B05060000000001000100340000002C0000000000
> _______________________________________________
> geeklog-devel mailing list
> geeklog-devel at lists.geeklog.net
> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>