import ModalBase from "./ModalBase";
import './ModalDiscordBot.scss';

import downloader from "../../assets/projects/discord-bot/video-downloader.png";
import spamPreventionWEBP from "../../assets/projects/discord-bot/spam-prevention.webp";
import spamPreventionPNG from "../../assets/projects/discord-bot/spam-prevention.png";
import imageProperties from "../../assets/projects/discord-bot/image-properties.png";
import phash from "../../assets/projects/discord-bot/phash.png";
import phishing from "../../assets/projects/discord-bot/phishing-list2.jpg";

export default function ModalDiscordBot() {
  return (
    <ModalBase id="modal-discord-bot">
      <h1>Discord Bot</h1>
      <p>
        A bot that connects to Discord social media to provide community moderation, hyperlink safety, spam detection, translations,
        and video downloading.
      </p>
      <p>
        This is my current biggest project.  It went through five major version changes and is made of 40,000 lines of code in total, with
        respect to keeping each file underneath 1,000 lines and being well-documented.
      </p>
      <p>
        Skills gained: version control handling, databases, APIs, perceptual hashing, AWS Lightsail, and tracking/deploying fixes.
      </p>

      {/* <a className="btn btn-secondary mb-5" href="#" target="_blank">Source code</a> */}

      <h2>Community Safety</h2>
      <p>The bot has methods to protect users from malicious URLs, malicious files, prevent spam, and preventing GPS data from leaking.</p>

      <h2>Spam prevention</h2>
      <p>
        At the time, there was one main type of spam that we kept seeing: get rich quick crypto market schemes targeted at investors.
      </p>
            
      <p>
        The wording on all of these appeared to follow a pattern, but this pattern needs to be expressed in a way a computer
        can understand it.  To find this pattern, I collected as many spam messages as I could over several months:
      </p>

      
      <code id="scams" aria-hidden="false">
        <p>
          I&apos;ll <span className="teach">help</span> the <span className="person">first 20 people</span> 
          <span className="interest"> interested</span> to <span className='earn'>earn $50k</span> on the 
          <span className="earn-crypto"> crypto market</span> <span className="time">72hours</span> send me a 
          <span className="contact"> dm</span>!! ask me HOW ! <span className="contact-external">+1 (123) 456-7890</span>
      </p>

        <p>
          I will <span className="teach">teach</span> you how to <span className='earn'>earn $250k</span> or more. 
          If <span className="interest">interested</span> send me a <span className="contact"><span className="contact">direct message</span></span>. 
          <span className="contact"> Contact</span> via <span className="contact">WhatsApp</span>
          <span className="contact-external"> +1 (123) 456-7890</span> or <span className="contact-external">https://t.me/lorem-ipsum1234</span>
      </p>


        <p>
          I&apos;ll <span className="teach">help</span> <span className="person">10 people</span> on how to 
          <span className="earn"> earn $47,499</span> in just <span className="time">72hours</span> from the 
          <span className="earn-crypto"> crypto market</span>. Send me a <span className="contact">friend request </span> 
          or <span className="contact"> drop a message</span> to know (<span className="h">How</span>)
        </p>

        <p>
          I&apos;ll <span className="teach">teach</span> <span className="person">anyone</span> how to trade
          and <span className="earn"> earn $40k-$60k</span> within <span className="time">72 hours</span> from
          the <span className="earn-crypto"> crypto market</span> but you will pay me 10% of your profit when you receive it. 
          Note only <span className="interest">interested</span> people should apply, <span className="contact">drop a message</span> let&apos;s
          get started by asking (<span className="l">HOW</span>)
        </p>

        <p>
          I&apos;ll <span className="teach">Teach</span> <span className="person">Anyone</span> on how 
          to <span className="earn">earn $30k, $50k, $100k, $200k</span> from the <span className="earn-crypto">Crypto market </span>
          in <span className="time"> 3 trading days</span>, but you’ll pay me a 10% commission after receiving your profits, 
          if <span className="interest">interested</span> send me a <span className="contact">direct message </span>
          or <span className="contact"> contact</span> me via <span className="contact">WhatsApp </span> 
          let&apos;s get started <span className="contact-external">+1 (123) 456-7890</span>
        </p>
      </code>


      <p>
        Once there were enough collected messages, it became clear that each spam text shared almost all the same words, 
        but they had slight variations with word order, different numbers, contact methods, and random captialization.
      </p>
      <p>
        The original plan was to write a simple regular expression that could target these, but this required more 
        predictable positions.  All the variations made this difficult to do.  This plan was scrapped for a better one.
      </p>
      <p>
        Instead of matching whole sentences with specific word positions, we can match the common parts in each.
        These parts are:
      </p>
      <ol>
        <li>
          <strong>Contact method. </strong>
          <span>
            Line number, Telegram, WhatsApp, or Twitter.  Some off-platform contact method, because the account they 
            are spamming texts with is about to get deleted.  We place the most emphasis on this part as it is unsual
            in most contexts.
          </span>
        </li>
        <li>
          <strong>Money to earn. </strong>
          <span>
            Somewhere in the text is a dollar amount, the word earn, and the phrase crypto market.  This is the hook
            to get people <span className="interest">interested</span>.
          </span>
        </li>
        <li>   
          <strong>Misc phrases. </strong>
          <span>10%, x people, how to earn, how to trade, how to trade or earn, x hours, x days, etc.</span>
        </li>
      </ol>

      <p>
        Putting these parts together creates a very rough semantic analysis engine.  The accuracy is not 100%, but 
        to be able to detect the majority of these spam texts is enough.  It can be fine tuned by testing the engine 
        on both spam and non-spam text, and ensuring that all spam is caught and non-spam is ignored by adjusting 
        the weight of keywords.
      </p>
      <picture>
        <source srcSet={spamPreventionWEBP} type="image/webp" />
        <img src={spamPreventionPNG}
          alt="Knobs to select how weight is applied to each keyword, and the output threshold needed to be detected as spam."
          width="500px"
          className="mx-auto d-block img-fluid mb-4"
        />
      </picture>

      <h2>Unsupported image types and GPS data</h2>
      <p>
        Users can accidentally reveal their exact GPS location by uploading the wrong image types.  This behavior is unexpected and
        dangerous to the average user.
      </p>
      <p>
        Pictures taken by camera phones can store EXIF MetaData which contain information about how the photo was taken, including
        GPS data.  When uploading images to Discord, it automatically strips out all this data.  However, this only occurs on image
        file types that are known by Discord.  Image types that are unknown are not handled as images, but rather generic files,
        and they are uploaded as-is.
      </p>
      <img className="img img-fluid mb-4 d-block mx-auto" style={{maxHeight: '500px'}} src={imageProperties}/>
      <p>
        To fix this, the app will do the following:
      </p>
      <ol>
        <li>Scan for these unhandled image types.</li>
        <li>Download the image and check metadata for GPS data.</li>
        <li>If GPS data is not found, stop here.</li>
        <li>Otherwise delete the image from conversation as a precaution.</li>
        <li>Inform the user why the image was deleted.</li>
        <li>Give user the option to op-out of automatic deletion.</li>
      </ol>
      <p>
        Steps 5 and 6 are important because it is also possible the user is uploading an image with GPS data that doesn&apos;t belong
        to them.  Maybe they downloaded an image of Times Square off Google Images and they reuploaded it.
      </p>
      <p>
        For cases such as these, we want to always explain what is going on and we can also allow the user to opt-out of the automatic
        deletion.  Making incorrect assumptions about what happened and automatically acting on it can be frustrating for the user.
      </p>

      <h2>Link safety</h2>
      <div className="row">
        <div className="col-12 col-lg-5 order-lg-1">
          <img id="phishing-image" src={phishing} className="img-fluid img justify-content-center d-block mx-auto"
          alt="A list of websites written using code format."/>
        </div>
        <div className="col-12 col-lg-7">
          <p>
            To protect users from phishing websites, we found an API with a list of over 25,000.
          </p>
          <p>
            When using an API, server ettiquite dictates to reduce stress to the API where possible.  This API in particular tells you
            exactly how to do this.
          </p>
          <p>
            Call the API once to grab the whole list, save it locally, and compare each link locally. This is better
            than making calls for each link comparison.  You can also use a websocket to listen for small changes to the list and update
            it over time.  It&apos;s still a good idea to grab the whole list at least once every day to start clean, due to the fact that
            websockets can drop packets.
          </p>
        </div>

      </div>
      
      
      <p>

      </p>

      <h2 style={{clear: "right"}}>Image banning</h2>
      <p>
        To prevent unwanted images from appearing such as sexually explicit, graphic, or not, this can be done with
        the <a rel='nofollow' href="https://cloud.google.com/vision?hl=en#demo">Google SafeSearch API</a> or with Image Hashing.
      </p>

      <h3>Google SafeSearch</h3>
      <p>
        SafeSearch will grade images on a confidence scale of 1 to 5 with a higher number meaning more likely that this image is obscene.
      </p>
      <p>
        The downside is how often it produces false positives, and even on images with a 5/5 score.  Images with colors that
        are close to skin tone are more likely to have false positives.  Over time the technology will get better.
      </p>

      <h3>Image Hashing</h3>
      <p>
        Image hashing is a different way of comparing similarity of two images by using perceptual hashing.  Start by downsizing the
        image resolution to something very small, and in the process, take the averages of all the pixels.  The resolution needs
        to be the same for each image, so that the pixels locations are the same.  Finally compute the hash or the text value by
        converting the image RGB values to characters, and the x,y values to character position.
      </p>
      <img className="img img-fluid" src={phash}/>
      <p>
        You can now compare two images by checking how similar their hashes are with an algorithm such as Levenshtein Distance.  It will count
        the number of edits needed to make one hash meet another.  A lower number means higher similarity. This check is resistant against
        resizing because both images were resized to the same amount.  If the image is converted to greyscale it will be resistant against
        filters.
      </p>
      <p>
        Image hashing will fail when it comes to high detail images with similar structure such as driver's licenses, essays, and bar codes.
        For this reason, each banned image needs to be carefully chosen.
      </p>
      <h2>Youtube Downloading</h2>
      <p>
        To download Youtube videos for offline use, a combination of yt-dlp, ffmpeg, AWS bucket and cheerio were used.
      </p>
      <p>
        The process for downloading is:
      </p>
      <ol>
        <li>If needed, convert clips to youtube video IDs.</li>
        <li>Get formats list with yt-dlp and determine the best ones.</li>
        <li>Attempt direct upload if possible.</li>
        <li>Otherwise convert it so that it will upload.</li>
      </ol>
      <img
        src={downloader} id="yt-chart" className="img-fluid img-thumbnail mx-auto d-block"
        alt="Flow chart of video downloader explaining how a download is done."
      />

      <h3>1. Convert clips</h3>
      <p>
        Youtube clips are made by trimming down Youtube videos to a start and stop time so short portions of videos
        can be shared.  They aren&apos;t handled the same way as regular Youtube videos, and can&apos;t be read
        by yt-dlp.  However we can grab the original video from the clip by reading the link.
      </p>
      <p>
        Using the clip link, we first fetch the page and then parse the HTML with Cheerio.  Inside the meta tags
        contains the original video ID which can be used to build a regular link that yt-dlp can use.
      </p>
      <h3>2. Formats</h3>
      <p>
        After running yt-dlp, a formats list is given back with roughly 8 to 15 different file types.  We sort these formats
        by a score calculator that compares quality and filesize in descending order and break these formats down
        into three categories:
      </p>
      <ul>
        <li>Video only</li>
        <li>Audio only</li>
        <li>Combination</li>
      </ul>
      <p>
        We aim for a combination format that contains both, and the filesize must be underneath the Discord upload limitation.
        If so, the link can be passed directly to Discord for streaming and the total wait time will be 4 seconds.  We can stop
        here. But if not, we have to run a conversion which averages 3 minutes long.
      </p>
      <h3>2.5. Remote link</h3>
      <p>
        As a fallback option, we upload the video to a hosting server, and send a link back to the user.
        The remote link does not get shown directly in Discord within the native video player, but instead sent as a link.
        This option is used because uploading to remote is faster than converting the video, and gives the user a chance to save 2 minutes.
      </p>
      <h3>3. Download and conversion</h3>
      <p>
        We grab the highest quality format that is closest to the upload limit size.  Then reduce the quality just enough
        so that the filesize is underneath that upload limit.  However there&apos;s issues with determining size.
      </p>
      <p>
        Video file size is hard to predict after conversion, due to the way it stores data with a variable bit rate.
        The size can be up to a certain max size, but it is usually well underneath that limit. We can only guess the best
        settings based on past results.  Using a stream, we download the video and pass the contents into the conversion.
        At any point during the stream if the total size exceeds the limit, we cancel the process and try again with 
        lower quality.  We repeat this up to three times.  If the limit is not found after the third time, we stop here,
        because it may have been running for over 10 minutes.  Although, it is rare for this to happen, and the majority of videos
        only have to be converted once.
      </p>
      <p></p>

      <h2>Development cycle</h2>
      <p>
        The project went through five versions in total, and each version was a major rewrite of all commands.  The first versions
        were written quickly with the goal to get something rough and working. Fixing small bugs and polishing up apperance was
        lower priority.  Once the bot was pushed to production, the small bugs became much more apparent.
      </p>
      <p>
        Later versions were spent both addressing the bugs and rewriting code to use better programming styles.
        This split the code into two versions that were maintained at the same time: the production version with the bugs,
        and an alpha version with the newer stuff.  Most of the code could be swapped between the two, but the slight variations
        made it somewhat confusing.
      </p>
      <hr/>
      <p>
        In the concurrent production version, a combination of long-term and bandaid fixes were applied.  Since
        this version was going to be phased out soon, it did not matter if the fixes lasted, but the fact that
        the code could be reused in the newer version made certain parts worth fixing the right way.
      </p>
      <p>
        In the concurrent alpha version, Promises were replaced with async/await.  Functional programming was
        replaced with classes and inhertance which then circled back to functional programming. Commonjs was
        replaced with ESM.  Console was replaced with a custom logging system that could timestamp each error, log
        to file, and log to message.  
      </p>
      <p>
        Meanwhile Discord.js, the API that the bot was built on top of, was making breaking changes with each new
        release.  The biggest change was the depreciation of prefix based commands in support of slash based commands,
        due to a security concern that every third-party bot could listen to ease-drop in on conversations.  This
        resulted in the rewriting of the way all commands were handled.
      </p>
      <p>
        With all the writing and rewriting of code, it felt like much of this time was wasted, but speed is
        necessary in early development. Without speed the project may not reach deadlines to launch, and focusing
        too much on maintainability may result in development of features that never get used.  Usually speed 
        early on and a slow shift to maintainability fits most cycle.
      </p>

    </ModalBase>
  )
}