[geeklog-devel] COM_makeClickableLinks
Sami Barakat
furiousdog at gmail.com
Sun Aug 10 09:30:26 EDT 2008
Hi Michael,
I dont know if you are still working on this but I came across a web
site with a good regular expression just for this situation...but its
a bit longer then what we have been working with so far....wait for
it....
^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?
yup thats it. Although it looks complicated its actually very well
made and can handle a wide variety of urls. More info can be found at
this site where I found it
http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx
That regex wont work straight away, it had to be modified a little to
take account for the problem we had originally. So here is the
modified version
((ht|f)tp(s?)\:\/\/|~\/|\/)?([\w]+:\w+@)?(([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((\/?\w+\/)+|\/?)([\w\-%]+(\.[\w]{3,4})?)?((\?|&|&)[\w\-%]+=[\w\-%]+)*)
It works with all the test urls we have been using and then some. I
did a little benchmarking and this one works out to be around half a
second longer when executed 10,000 times, so although it looks nasty
half a sec is nothing to complain about.
Here is the complete function
function COM_makeClickableLinks( $text )
{
$regex = '/((ht|f)tp(s?)\:\/\/|~\/|\/)?([\w]+:\w+@)?(([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((\/?\w+\/)+|\/?)([\w\-%]+(\.[\w]{3,4})?)?((\?|&|&)[\w\-%]+=[\w\-%]+)*)/is';
$text = preg_replace( $regex, '<a href="http://\\5">\\5</a>', $text );
return $text;
}
A simple way to shorten the urls would be to just print the domain in
the anchor,
so this link
http://www.sub.url.com/folder?user=5&this=that
would be converted to
<a href="http://www.sub.url.com/folder?user=5&this=that">www.sub.url.com</a>
Maybe even include that little arrow icon that Wikipedia uses when
linking off site.
This can be done by changing the $replacement part of the preg function
'<a href="\\1\\5">\\5</a>'
to
'<a href="\\1\\5">\\6</a>'
\\1 = the http:// part of the url (or ftp:// or https:// etc.)
\\5 = the full url (without the http:// part)
\\6 = the domain name (www.url.com or url.com or www.sub.url.com etc.)
enjoy...
Sami
2008/8/3 Michael Jervis <mjervis at gmail.com>:
> Cheers Sami,
>
> It's looking a lot like we need to use regexp to locate/identify the
> urls, but then php to parse them and turn them into links.
>
> That gives an opportunity to shorten very long urls (within the anchor
> not the href attribute (perhaps via tinyurl's API rather than
> truncation))
>
> On Wed, Jul 30, 2008 at 14:17, Sami Barakat <furiousdog at gmail.com> wrote:
>> Hi,
>>
>> I think I've got it now, although its not a complete solution
>>
>> function COM_makeClickableLinks( $text )
>> {
>> $text = preg_replace(
>> '/([^"]?)(((ht|f)tps?):(\/\/)|(www\.))+((?=([^\s]+) ))?(\8|[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)/is',
>> '\\1<a href="http://\\6\\9">\\6\\9</a>', $text );
>> return $text;
>> }
>>
>> It seems to work well with the following strings:
>>
>> normal link http://www.url.com
>> normal link with early quote http://www.url.com/folder"stuff
>> link with and quotes "http://www.url.com "
>> www.url.com/ps
>> complicated link www.sub.url.com/folder/index.php?id=foo&user=bar
>>
>> it still fails however on these strings
>>
>> link with two www.url.com/ps
>> link with early quote and "http://www.url.com/folder"stuff
>>
>> The results of the two failed strings is
>>
>> link with two <a
>> href="http://www.url.com/ps ">www.url.com/ps </a>
>> link with early quote and "<a
>> href="http://www.url.com/folder"stuff">www.url.com/folder"stuff</a>
>>
>> The second string could probably be fixed by replacing this part of
>> the regular expression '[^\s]+' with this
>> '[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+'
>> But really regular expressions are more helpful when validating
>> strings or trying to find substrings in complicated strings, they are
>> not really made to exclude parts of a string. So it might be more
>> effective and less complicated to run through the expression twice.
>> The first time matching urls with on the end and the second
>> time without.
>>
>> Hope this helps
>> Sami
>>
>> 2008/7/29 Sami Barakat <furiousdog at gmail.com>:
>>> Hey,
>>>
>>> I have tried looking into this and I have come up with a partial
>>> solution. From my understanding the problem is when a url has a
>>> at the end which is getting parsed along with the url. I ask because I
>>> think Gmail has filtered out some of them. Anyway the following regex
>>>
>>> ([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])
>>>
>>> Seems to work fairly well. Here is the test code that I am using.
>>>
>>> echo '<pre>';
>>> $string = "normal link http://www.url.com PASS\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> $string = "link with and quotes \"http://www.url.com \" PASS\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> $string = "complicated link
>>> \"www.sub.url.com/folder/index.php?id=foo&user=bar \"
>>> PASS\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> $string = "problem link \"www.url.com/words \" FAIL\n";
>>> echo htmlentities(COM_makeClickableLinks($string));
>>> echo '</pre>';
>>>
>>> This produces
>>>
>>> normal link <a href="http://www.url.com">www.url.com</a> PASS
>>> link with and quotes "<a
>>> href="http://www.url.com">www.url.com</a> " PASS
>>> complicated link "<a
>>> href="http://sub.url.com/folder/index.php?id=foo&user=bar">sub.url.com/folder/index.php?id=foo&user=bar</a> "
>>> PASS
>>> problem link "<a href="http://url.com/word">url.com/word</a>s " FAIL
>>>
>>> As you can see the first 3 work, the problem occurs when a url ends
>>> with any of the characters: '&' or 'n' or 'b' or 's' or 'p' or ';'
>>>
>>> So www.url.com/ps would return <a href="http://url.com/">url.com/</a>ps
>>>
>>> This is due to the last bit of the regex "(?<![ ])" if I tried
>>> just doing (?<! ) but it does not work at all because the
>>> previous statement is being too greedy. There is also an issue with
>>> the www. being removed, but thats not too much of a problem at the
>>> moment.
>>>
>>> Also the COM_makeClickableLinks function can be simplified by removing
>>> the str_replace statment resulting in simply this
>>>
>>> function COM_makeClickableLinks( $text )
>>> {
>>> $text = preg_replace(
>>> '/([^"]?)(((ht|f)tps?):(\/\/)|www\.)([a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+)(?<![ ])/is',
>>> '\\1<a href="http://\\6">\\6</a>', $text );
>>> return $text;
>>> }
>>>
>>>
>>> in the original regex I was unsure why the "(\/|[+0-9a-z])" part was
>>> included. I dont think its necessary so I took it out, maybe there was
>>> a particular case that required it which Im overlooking.
>>>
>>> Anyhow I will have another crack at it later on, it really is a tough
>>> one, but this is as far as ive got so far.
>>>
>>> Sami
>>>
>>> 2008/7/28 Michael Jervis <mjervis at gmail.com>:
>>>> All (especially Sami!),
>>>>
>>>> There is a bug in the subject function. If it finds
>>>> "http://www.url.com" we end up with  <a
>>>> href=";http://www.url.com ">;http://www.url.com </a>;
>>>>
>>>> Which isn't good.
>>>>
>>>> The original regexp in COM_MakeClickableLinks is:
>>>>
>>>> /([^"]?)((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>>>
>>>> I think the first match ([^"]?) is spurious, it matches anything other
>>>> than " before a link. So bhttp://www.foo.com" matches, but
>>>> "http://www.foo.com doesn't.
>>>>
>>>> So that gives:
>>>> /((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))/is
>>>>
>>>> Resulting in:
>>>> <a href="http:///www.url.com ">http://www.url.com </a>
>>>>
>>>> So, need to add an "ignore trailing " bit to the clause. Closest
>>>> I can get is:
>>>> ((((ht|f)tps?):(\/\/)|www\.)[a-z0-9%&_\-\+,;=:@~#\/.\?\[\]]+(\/|[+0-9a-z]))(?= )
>>>>
>>>> Which results in:
>>>> <a href="http:///www.url.com">http://www.url.com</a>
>>>>
>>>> However, unless there were quotes round the link, it won't match! So
>>>> "http://www.foo.com" matches and is correctly processed, but
>>>> http://www.foo.com is not matched.
>>>>
>>>> My head is now hurt. Any suggestions?
>>>>
>>>> --
>>>> Michael Jervis
>>>> mjervis at gmail.com
>>>> 504B03041400000008008F846431E3543A820800000006000000060000007765
>>>> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
>>>> 0800000006000000060000000000000000002000000000000000776562676F64
>>>> 504B05060000000001000100340000002C0000000000
>>>> _______________________________________________
>>>> geeklog-devel mailing list
>>>> geeklog-devel at lists.geeklog.net
>>>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>>>>
>>>
>> _______________________________________________
>> geeklog-devel mailing list
>> geeklog-devel at lists.geeklog.net
>> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>>
>
>
>
> --
> Michael Jervis
> mjervis at gmail.com
> 504B03041400000008008F846431E3543A820800000006000000060000007765
> 62676F642B4F4D4ACF4F0100504B010214001400000008008F846431E3543A82
> 0800000006000000060000000000000000002000000000000000776562676F64
> 504B05060000000001000100340000002C0000000000
> _______________________________________________
> geeklog-devel mailing list
> geeklog-devel at lists.geeklog.net
> http://eight.pairlist.net/mailman/listinfo/geeklog-devel
>
More information about the geeklog-devel
mailing list