I've been playing around with some ColdFusion code to highlight Twitter hashtags in Tweets so that I can link them using ReReplaceNoCase.
I started off with a simple RegEx that has been published all over the Internet but after looking at several Tweets that had been incorrectly altered, I decided to do some experimenting of my own to see if I could improve upon the existing RegEx.
The original RegEx I was using looked like this:
##([a-z0-9_\-]+)
The problem with this was that although it identified Twitter hashtags correctly, it also identified HTML entities and replaced them with the same code I was using to activate the hashtags as anchors.
Running the RegEx on the following string
@NewMediaDev #RegEx This is very cool '
results in:
@NewMediaDev <a href="http://twitter.com/search?q=RegEx">#RegEx</a> This is very cool &<a href="http://twitter.com/search?q=39">#39</a>;
Clearly this is incorrect.
After a few hours of experimentation, I ended up with a new, improved (if somewhat longer) RegEx that works as it should, only turning #RegEx into a hashtag search link.
The new regular expression is:
##(([a-z_\-]+[0-9_\-]*[a-z0-9_\-]+)|([0-9_\-]+[a-z_\-]+[a-z0-9_\-]+))
Using this RegEx to replace the hashtags with linked equivalents on the same string returns the following:
@NewMediaDev <a href="http://twitter.com/search?q=RegEx">#RegEx</a> This is very cool '
Marvelous!
UPDATE: After much playing around, I realised that if a hashtag was purely numerical then the RegEx I've outlined above would miss it so I threw the question out to the Universe and Mahcsig returned an answer that put me on the right track.
Essentially, he suggested I used the RegEx of (?:[^&]|^)##([a-z0-9_\-]+) which matches hashtags without an & in front of it.
This was close but still not quite right as it was matching leading spaces, commas etc. in fact, it was matching any leading character.
For the ReReplaceNoCase to work as I needed, I ended up modifying the new RegEx so that it returned the first group as well as the second by removing the ?:. I was then able to access the back references to replace the hashtag with a link whilst keeping the leading character intact like so:
ReReplaceNoCase(arguments.tweet, "([^&]|^)##([a-z0-9_\-]+)", "\1<a href=""http://twitter.com/search?q=%23\2"">##\2</a>", "ALL")
