klips/dotnet/sitemap
Shaun Reed d1fb33c58e [dotnet] Add dotnet projects and examples
+ Sitemap generator I created while learning the dispose pattern
+ Testing project for learning general C#
2022-05-04 14:59:17 -04:00
..
ConsoleApp [dotnet] Add dotnet projects and examples 2022-05-04 14:59:17 -04:00
SiteMapLibrary [dotnet] Add dotnet projects and examples 2022-05-04 14:59:17 -04:00
.gitignore [dotnet] Add dotnet projects and examples 2022-05-04 14:59:17 -04:00
README.md [dotnet] Add dotnet projects and examples 2022-05-04 14:59:17 -04:00
sitemap.sln [dotnet] Add dotnet projects and examples 2022-05-04 14:59:17 -04:00

README.md

Sitemap generator I created while learning some C#. Example of using the library is in ConsoleApp/Program.cs, files used for testing are in ConsoleApp/TestFiles/ ConsoleApp/TestFiles/sitemap.xml currently contains the sitemap for my website. If we run the console application with a different URL that targets this same file, the file will be overwritten with the new sitemap. There is no need to delete or recreate files manually.

I plan to check for a robots.txt while generating sitemaps to prevent crawling pages that aren't useful. For now there is no use for a robots.txt, the SiteMap.Crawl() function visits the URL provided to the SiteMap constructor. Regex is used to check the visited page and match URLs with the same base domain, the URLS found are logged for the crawler to visit. Each time we finish collecting URLS on a page, we move to the next URL in the queue and repeat this process. Once we finish crawling all URLs, an XML sitemap is generated where the URLs are sorted by their length.

I used sitemaps.org - XML Format to determine the proper formatting for the sitemap. For now, since the web application I used for testing does not respond with Last-Modified in the HTTP header, the last modified time is set to the date the sitemap was generated. The priority fields are all set to the default value indicated on sitemaps.org, which is 0.5. This is to avoid confusing crawlers with a huge list of 'top-priority' pages to crawl. All changefreq fields of the sitemap are marked as daily.

The primary motivation for this project was learning about unmanaged resources in C#, and trying out the Dispose Pattern for myself. If someone reading this were to find a problem with the way I handled disposing of the HttpClient in the SiteMap class, feel free to let me know :) Creating an issue, PR, or sending an email is all acceptable.

Future plans

  • Parse robots.txt to avoid crawling pages that are not desired
  • Test the generator with an application that serves LastModified date; Use it if available
  • Set priority in a more useful way, or allow some form of customization of the way this is handled.
  • Set changefreq in a more useful way, or allow some form of customization of the way this is handled.
  • Generate a regex pattern to match, if one is not provided

For now, the general use of this library is seen in the example below.

using SiteMapLibrary;

// Create an XmlManager to use for generating our sitemap; Provide a file path (and optional Xml settings; See ctor)
var mgr = new XmlManager("/home/kapper/Code/klips/dotnet/sitemap/ConsoleApp/TestFiles/sitemap.xml");
// If we want to output the sitemap to the console, instead of saving to a file
// var mgr = new XmlManager("Console.Out");

// Provide a base URL to start crawling, an XmlManager, and a Regex pattern to use for matching URLs while crawling
using SiteMap siteMap = new SiteMap("https://knoats.com", mgr,
  new("(http?s://knoats.com(?!.*/dist/|.*/settings/|.*/register/|.*/login/|.*/uploads/|.*/export/|.*/search?).*?(?=\"))"));
// Start crawling; When this returns, we have visited all found URLs and wrote them to our sitemap
await siteMap.Crawl();