46 lines
3.4 KiB
Markdown
46 lines
3.4 KiB
Markdown
|
|
Sitemap generator I created while learning some C#.
|
|
Example of using the library is in `ConsoleApp/Program.cs`, files used for testing are in `ConsoleApp/TestFiles/`
|
|
`ConsoleApp/TestFiles/sitemap.xml` currently contains the sitemap for my website.
|
|
If we run the console application with a different URL that targets this same file, the file will be overwritten with the new sitemap.
|
|
There is no need to delete or recreate files manually.
|
|
|
|
I plan to check for a `robots.txt` while generating sitemaps to prevent crawling pages that aren't useful.
|
|
For now there is no use for a `robots.txt`, the `SiteMap.Crawl()` function visits the URL provided to the `SiteMap` constructor.
|
|
Regex is used to check the visited page and match URLs with the same base domain, the URLS found are logged for the crawler to visit.
|
|
Each time we finish collecting URLS on a page, we move to the next URL in the queue and repeat this process.
|
|
Once we finish crawling all URLs, an XML sitemap is generated where the URLs are sorted by their length.
|
|
|
|
I used [sitemaps.org - XML Format](https://www.sitemaps.org/protocol.html) to determine the proper formatting for the sitemap.
|
|
For now, since the web application I used for testing does not respond with [Last-Modified](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified) in the HTTP header, the last modified time is set to the date the sitemap was generated.
|
|
The `priority` fields are all set to the default value indicated on sitemaps.org, which is `0.5`.
|
|
This is to avoid confusing crawlers with a huge list of 'top-priority' pages to crawl.
|
|
All `changefreq` fields of the sitemap are marked as `daily`.
|
|
|
|
The primary motivation for this project was learning about unmanaged resources in C#, and trying out the [Dispose Pattern](https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/implementing-dispose?redirectedfrom=MSDN#implement-the-dispose-pattern) for myself.
|
|
If someone reading this were to find a problem with the way I handled disposing of the `HttpClient` in the `SiteMap` class, feel free to let me know :) Creating an issue, PR, or sending an email is all acceptable.
|
|
|
|
### Future plans
|
|
* Parse `robots.txt` to avoid crawling pages that are not desired
|
|
* Test the generator with an application that serves `LastModified` date; Use it if available
|
|
* Set `priority` in a more useful way, or allow some form of customization of the way this is handled.
|
|
* Set `changefreq` in a more useful way, or allow some form of customization of the way this is handled.
|
|
* Generate a regex pattern to match, if one is not provided
|
|
|
|
For now, the general use of this library is seen in the example below.
|
|
|
|
```C#
|
|
using SiteMapLibrary;
|
|
|
|
// Create an XmlManager to use for generating our sitemap; Provide a file path (and optional Xml settings; See ctor)
|
|
var mgr = new XmlManager("/home/kapper/Code/klips/dotnet/sitemap/ConsoleApp/TestFiles/sitemap.xml");
|
|
// If we want to output the sitemap to the console, instead of saving to a file
|
|
// var mgr = new XmlManager("Console.Out");
|
|
|
|
// Provide a base URL to start crawling, an XmlManager, and a Regex pattern to use for matching URLs while crawling
|
|
using SiteMap siteMap = new SiteMap("https://knoats.com", mgr,
|
|
new("(http?s://knoats.com(?!.*/dist/|.*/settings/|.*/register/|.*/login/|.*/uploads/|.*/export/|.*/search?).*?(?=\"))"));
|
|
// Start crawling; When this returns, we have visited all found URLs and wrote them to our sitemap
|
|
await siteMap.Crawl();
|
|
```
|