Cogitatio materialis est

How to customize markdown link syntax?

1st Nov 2013 Tags: #development #hacks #markdown #php

On my blog I have a lot of links on other sites. But it's good practice to use rel="nofollow" if you add a link on untrusted site. Since I write articles in markdown, I find a way to customize markdown's parser behevioure to allow this extra attributes for html <a> tags.

Idea

The idea is deadly simple -- just loop through all links, and if it is external one, add rel="nofollow" target="_blank". Which one is external? You can decide it for yourself. For example, change syntax by adding ! before URLs and check for that
([1]:!http://my-site.com/article1) and so on. My variant is below.

Link is external, if it is not relative/local :) So I do nothing if href:

  • starts with / (relative links)
  • starts with # (anchors links)

In result I have at least two ways to do that.

Method #1 (php only)

If you use some flatfile CMS, like Pico or Phile or anything else, it's a good practice to write a plugin. Since it's easy to realize with clean php, let's do it.

Code it

My resulting plugin for PhileCMS is on github, and here is some demo-code:

/**
 * Customize links in document.
 *
 * Use it, if you want to add extra params to external links.
 * Now available: rel="nofollow" and target="_blank".
 *
 * Modify <a> tags, if find any.
 * Don't edit, if "href":
 * * starts with '/', '#'
 **/
// ...
$content  = 'your page content';
$content_ecoding = '<YOUR CONTENT ENCODING>'; // like 'UTF-8'

$dom = new DOMDocument();
// convert to neutral 'html-entities' encoding first
$rightEncodingHtml = mb_convert_encoding($content, 'HTML-ENTITIES', $content_ecoding);
$dom->loadHTML( $rightEncodingHtml );

foreach ( $dom->getElementsByTagName("a") as $a_tag )
{
  if ( !$a_tag->hasAttribute("href"))
    continue;

  $href_url = $a_tag->getAttribute("href");
  $start_with_slash = ($href_url[0] == '/') ? true : false;
  $start_with_hash  = ($href_url[0] == '#') ? true : false;

  if ($start_with_slash || $start_with_hash)
    continue;

  $a_tag->setAttribute("target", "_blank");
  $a_tag->setAttribute("rel",    "nofollow");
}

$tmp  = preg_replace('/^<!DOCTYPE.+?>/'
  , ''
  , str_replace(array('<html>', '</html>', '<body>', '</body>')
      , array('', '', '', '')
  , $dom->saveHTML()));

// convert encoding back
$content = mb_convert_encoding($tmp, $content_ecoding, 'HTML-ENTITIES');
// ...

Prons

+ Need not to edit cms/parser core

Cons

- `saveHTML()` method will complement your code to *correct* html, with DOCTYPE, all closed tags and so on. Is it desired behaviour for you?
- If you use not-latin text, you should encode it explicitly into html-enteties and then back. If you use some html-code examples encoded into 'html-enteties', they will become back just html

Method #2 (edit php-markdown parser)

If you cannot write a plugin or cons of method#1 is critical for you, let's do it deeper and simpler -- during parsing .md and generating html-code.

Code it

If you look at php-markdown library (if you use another markdown-parser -- idea is the same):

There are two functions:

  • _doAnchors_reference_callback($matches)
  • _doAnchors_inline_callback($matches)

for links-reference and inline-links respectively.

NOTE: actually, there are 4 functins, 2 for php-markdown and 2 for php-markdownExtra. Edit the one you are using.

Since I use php-markdownExtra, I edited this and this one function.

Resulting commits look like (on github):

---
 Michelf/Markdown.php | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/Michelf/Markdown.php b/Michelf/Markdown.php
index 088b7cd..1210fc0 100644
--- a/Michelf/Markdown.php
+++ b/Michelf/Markdown.php
@@ -2300,7 +2300,17 @@ protected function _doAnchors_reference_callback($matches) {
    }
    if (isset($this->ref_attr[$link_id]))
      $result .= $this->ref_attr[$link_id];
+
+   /* check $url, if it external(absolute) or local(relative), so
+    * do nothing, if $url is local:
+    * starts with '/' or '#'
+    **/
+   if ( $url[0] != '/' && $url[0] != '#' )
+   {
+     $result .= 'rel="nofollow" target="_blank"';
+   }
+
    $link_text = $this->runSpanGamut($link_text);
    $result .= ">$link_text</a>";
    $result = $this->hashPart($result);

@@ -2326,7 +2336,17 @@ protected function _doAnchors_inline_callback($matches) {
      $result .=  " title=\"$title\"";
    }
    $result .= $attr;
+
+   /* check $url, if it external(absolute) or local(relative), so
+    * do nothing, if $url is local:
+    * starts with '/' or '#'
+    **/
+   if ( $url[0] != '/' && $url[0] != '#' )
+   {
+     $result .= 'rel="nofollow" target="_blank"';
+   }
+
    $link_text = $this->runSpanGamut($link_text);
    $result .= ">$link_text</a>";
-- 

Pros

+ easy to implement
+ no troubles with encodings
+ easy to cache result, if you use some cache-engine for resulting html output

Cons

- need to edit parser core
- cannot use full URL for your site inside articles (just relative, `/content...` )

Inspired by this answer on SO.