[geeklog-devel] COM_makeClickableLinks

Sami Barakat furiousdog at gmail.com
Sun Aug 10 09:30:26 EDT 2008


Hi Michael,

I dont know if you are still working on this but I came across a web
site with a good regular expression just for this situation...but its
a bit longer then what we have been working with so far....wait for
it....

^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

yup thats it. Although it looks complicated its actually very well
made and can handle a wide variety of urls. More info can be found at
this site where I found it
http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

That regex wont work straight away, it had to be modified a little to
take account for the problem we had originally. So here is the
modified version

((ht|f)tp(s?)\:\/\/|~\/|\/)?([\w]+:\w+@)?(([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((\/?\w+\/)+|\/?)([\w\-%]+(\.[\w]{3,4})?)?((\?|&|&)[\w\-%]+=[\w\-%]+)*)

It works with all the test urls we have been using and then some. I
did a little benchmarking and this one works out to be around half a
second longer when executed 10,000 times, so although it looks nasty
half a sec is nothing to complain about.

Here is the complete function

function COM_makeClickableLinks( $text )
{
$regex = '/((ht|f)tp(s?)\:\/\/|~\/|\/)?([\w]+:\w+@)?(([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((\/?\w+\/)+|\/?)([\w\-%]+(\.[\w]{3,4})?)?((\?|&|&)[\w\-%]+=[\w\-%]+)*)/is';

$text = preg_replace( $regex, '<a href="http://\\5">\\5</a>', $text );
return $text;
}

A simple way to shorten the urls would be to just print the domain in
the anchor,
so this link
http://www.sub.url.com/folder?user=5&this=that
would be converted to
<a href="http://www.sub.url.com/folder?user=5&this=that">www.sub.url.com</a>
Maybe even include that little arrow icon that Wikipedia uses when
linking off site.

This can be done by changing the $replacement part of the preg function
'<a href="\\1\\5">\\5</a>'
to
'<a href="\\1\\5">\\6</a>'

\\1 = the http:// part of the url (or ftp:// or https:// etc.)
\\5 = the full url (without the http:// part)
\\6 = the domain name (www.url.com or url.com or www.sub.url.com etc.)

enjoy...

Sami

2008/8/3 Michael Jervis <mjervis at gmail.com>:

> Cheers Sami,

>

> It's looking a lot like we need to use regexp to locate/identify the

> urls, but then php to parse them and turn them into links.

>

> That gives an opportunity to shorten very long urls (within the anchor

> not the href attribute (perhaps via tinyurl's API rather than

> truncation))

>

> On Wed, Jul 30, 2008 at 14:17, Sami Barakat <furiousdog at gmail.com> wrote:

>> Hi,

>>

>> I think I've got it now, although its not a complete solution

>>

>> function COM_makeClickableLinks( $text )

>> {

>> $text = preg_replace(

>> '/([^"]?)(((ht|f)tps?):(\/\/)|(www\.))+((?=([^\s]+)&nbsp;))?(\8|[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)/is',

>> '\\1<a href="http://\\6\\9">\\6\\9</a>', $text );

>> return $text;

>> }

>>

>> It seems to work well with the following strings:

>>

>> normal link http://www.url.com

>> normal link with early quote http://www.url.com/folder"stuff

>> link with &nbsp; and quotes "http://www.url.com&nbsp;"

>> www.url.com/ps&nbsp;

>> complicated link www.sub.url.com/folder/index.php?id=foo&amp;user=bar&nbsp;

>>

>> it still fails however on these strings

>>

>> link with two &nbsp; www.url.com/ps&nbsp;&nbsp;

>> link with early quote and &nbsp; "http://www.url.com/folder"stuff&nbsp;

>>

>> The results of the two failed strings is

>>

>> link with two &nbsp; <a

>> href="http://www.url.com/ps&nbsp;">www.url.com/ps&nbsp;</a>&nbsp;

>> link with early quote and &nbsp; "<a

>> href="http://www.url.com/folder"stuff">www.url.com/folder"stuff</a>&nbsp;

>>

>> The second string could probably be fixed by replacing this part of

>> the regular expression '[^\s]+' with this

>> '[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+'

>> But really regular expressions are more helpful when validating

>> strings or trying to find substrings in complicated strings, they are

>> not really made to exclude parts of a string. So it might be more

>> effective and less complicated to run through the expression twice.

>> The first time matching urls with &nbsp; on the end and the second

>> time without.

>>

>> Hope this helps

>> Sami

>>

>> 2008/7/29 Sami Barakat <furiousdog at gmail.com>:

>>> Hey,

>>>

>>> I have tried looking into this and I have come up with a partial

>>> solution. From my understanding the problem is when a url has a &nbsp;

>>> at the end which is getting parsed along with the url. I ask because I

>>> think Gmail has filtered out some of them. Anyway the following regex

>>>

>>> ([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![&nbsp;])

>>>

>>> Seems to work fairly well. Here is the test code that I am using.

>>>

>>> echo '<pre>';

>>> $string = "normal link http://www.url.com PASS\n";

>>> echo htmlentities(COM_makeClickableLinks($string));

>>> $string = "link with &nbsp; and quotes \"http://www.url.com&nbsp;\" PASS\n";

>>> echo htmlentities(COM_makeClickableLinks($string));

>>> $string = "complicated link

>>> \"www.sub.url.com/folder/index.php?id=foo&amp;user=bar&nbsp;\"

>>> PASS\n";

>>> echo htmlentities(COM_makeClickableLinks($string));

>>> $string = "problem link \"www.url.com/words&nbsp;\" FAIL\n";

>>> echo htmlentities(COM_makeClickableLinks($string));

>>> echo '</pre>';

>>>

>>> This produces

>>>

>>> normal link <a href="http://www.url.com">www.url.com</a> PASS

>>> link with &nbsp; and quotes "<a

>>> href="http://www.url.com">www.url.com</a>&nbsp;" PASS

>>> complicated link "<a

>>> href="http://sub.url.com/folder/index.php?id=foo&amp;user=bar">sub.url.com/folder/index.php?id=foo&amp;user=bar</a>&nbsp;"

>>> PASS

>>> problem link "<a href="http://url.com/word">url.com/word</a>s&nbsp;" FAIL

>>>

>>> As you can see the first 3 work, the problem occurs when a url ends

>>> with any of the characters: '&' or 'n' or 'b' or 's' or 'p' or ';'

>>>

>>> So www.url.com/ps would return <a href="http://url.com/">url.com/</a>ps

>>>

>>> This is due to the last bit of the regex "(?<![&nbsp;])" if I tried

>>> just doing (?<!&nbsp;) but it does not work at all because the

>>> previous statement is being too greedy. There is also an issue with

>>> the www. being removed, but thats not too much of a problem at the

>>> moment.

>>>

>>> Also the COM_makeClickableLinks function can be simplified by removing

>>> the str_replace statment resulting in simply this

>>>

>>> function COM_makeClickableLinks( $text )

>>> {

>>> $text = preg_replace(

>>> '/([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![&nbsp;])/is',

>>> '\\1<a href="http://\\6">\\6</a>', $text );

>>> return $text;

>>> }

>>>

>>>

>>> in the original regex I was unsure why the "(\/|[+0-9a-z])" part was

>>> included. I dont think its necessary so I took it out, maybe there was

>>> a particular case that required it which Im overlooking.

>>>

>>> Anyhow I will have another crack at it later on, it really is a tough

>>> one, but this is as far as ive got so far.

>>>

>>> Sami

>>>

>>> 2008/7/28 Michael Jervis <mjervis at gmail.com>:

>>>> All (especially Sami!),

>>>>

>>>> There is a bug in the subject function. If it finds

>>>> "http://www.url.com" we end up with &nbsp<a

>>>> href=";http://www.url.com&nbsp">;http://www.url.com&nbsp</a>;

>>>>

>>>> Which isn't good.

>>>>

>>>> The original regexp in COM_MakeClickableLinks is:

>>>>

>>>> /([^"]?)((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is

>>>>

>>>> I think the first match ([^"]?) is spurious, it matches anything other

>>>> than " before a link. So bhttp://www.foo.com" matches, but

>>>> "http://www.foo.com doesn't.

>>>>

>>>> So that gives:

>>>> /((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is

>>>>

>>>> Resulting in:

>>>> &nbsp;<a href="http:///www.url.com&nbsp">http://www.url.com&nbsp</a>

>>>>

>>>> So, need to add an "ignore trailing &nbsp;" bit to the clause. Closest

>>>> I can get is:

>>>> ((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))(?=&nbsp;)

>>>>

>>>> Which results in:

>>>> &nbsp;<a href="http:///www.url.com">http://www.url.com</a>&nbsp;

>>>>

>>>> However, unless there were quotes round the link, it won't match! So

>>>> "http://www.foo.com" matches and is correctly processed, but

>>>> http://www.foo.com is not matched.

>>>>

>>>> My head is now hurt. Any suggestions?

>>>>

>>>> --

>>>> Michael Jervis

>>>> mjervis at gmail.com

>>>> 504B03041400000008008F846431E3543A820800000006000000060000007765

>>>> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82

>>>> 0800000006000000060000000000000000002000000000000000776562676F64

>>>> 504B05060000000001000100340000002C0000000000

>>>> _______________________________________________

>>>> geeklog-devel mailing list

>>>> geeklog-devel at lists.geeklog.net

>>>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel

>>>>

>>>

>> _______________________________________________

>> geeklog-devel mailing list

>> geeklog-devel at lists.geeklog.net

>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel

>>

>

>

>

> --

> Michael Jervis

> mjervis at gmail.com

> 504B03041400000008008F846431E3543A820800000006000000060000007765

> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82

> 0800000006000000060000000000000000002000000000000000776562676F64

> 504B05060000000001000100340000002C0000000000

> _______________________________________________

> geeklog-devel mailing list

> geeklog-devel at lists.geeklog.net

> http://eight.pairlist.net/mailman/listinfo/geeklog-devel

>




More information about the geeklog-devel mailing list