How I made my own citation engine to spite the Chegg Citation Complex
When you want something done right, sometimes, you have to do it yourself.
Never settle for a sub-par product. Ever.
Boring tasks suck. Boring, frustrating tasks suck more.
That’s why I particularly hate visiting most citation websites when writing papers. As a student, I’m constantly referencing other people’s work. While this is very important, it’s often much more difficult than necessary. Over the years, I’ve found myself in the same situation countless times:
- Staring at a long list of links that need to become citations
- Dodging pop-up Ads as I try to cite links in a sub-par citation engine
- Wasting 30 minutes to do something that should have taken 5 minutes
Even after the headache of watching pop-up Ads to use the citation engine, I usually had to complete the citation manually because the engine couldn’t cite my link. For a long time, I was content paying Chegg the 25-minute tax on my time to use one of its free citation resources. However, on this particular day, I was feeling quite vindictive.
I decided to stop settling for a mediocre product. I made my own citation engine.
Charting the road map
I had a basic idea for how it would work: grab HTML from a link provided by the user to access the citation components. I made a few simplifications to make the task a little easier to do over the course of a few days.
- My citation engine would only work in one format — MLA. If I really liked what I had built, I could add more formats.
- My generator would only work on links. I didn’t want to get into the technical problem of reading PDFs and text files for key information. I also didn’t want to learn niche citation formats for other source types.
With these simplifications, I could start building my engine.
Testing my theory with BeautifulSoup4
I wanted my engine to be 10x better than any of the Chegg resources. I used BeautifulSoup to look for citation components in the source’s HTML. My goal was to pull the author’s name, article title, publication date, and publisher with my engine. Before coding, I needed to set a smaller goal for my citation engine to hit. I needed a source with a strict HTML format and consistent article descriptors. Most journalism satisfies both of these criteria, but my target needed to be even smaller.
I started with The New York Times because they are fairly tech-forward. Their online articles have been consistently formatted for at least the last 20 years (and this even extends to some archive articles). After a quick inspection of their source code, they seemed like a great first target. I started by looking for citation tags manually in the browser.
Finding the right HTML tag to parse in BeautifulSoup was the hardest part. I looked for patterns between articles from different publication dates. Finally, I noticed that the tag containing the author’s name is formatted the same way across the website. This find was a great first step; however, I was wary of querying by class or id attributes because good websites change these attributes fairly frequently. I looked for attributes that were human-readable rather than incoherent strings. The latter were most likely auto-generated and are likely to change.
After some digging, I found a <span> tag that consistently had an itemprop attribute ascribed to a “name” value. This would be my first pull. I ran a quick, BeautifulSoup script for the <span> tag and found the author names, as expected. I did this with a few articles some from a few days ago, some from a few years ago, and even some from the NYT archive. To my surprise, they all worked wonderfully. I cleaned the strings from these <span> tags and returned an author name.
I followed the same process for all the components of an MLA citation. All I needed to do was format these parts and arrange them appropriately. Within a few hours of setting out, I had an engine that could cite any NYT article!
Expanding the citation engine
This was a great breakthrough but it wasn’t enough. Not all websites are as clean as nytimes.com, and my engine needs to cite ANY link to truly be 10x better than what’s out there. I compared The New York Times HTML code with HTML from other publications. I looked at:
- Vox (strong online presence, tech-forward),
- The Chicago Tribune (large circulation, but not quite NYT-tier),
- and New Jersey’s very own Star Ledger (mix of local and national news).
I was stumped for a while, until I remembered that all of these websites share one motivation: they want their articles to be seen on the internet, and therefore they must format for SEO (search engine optimization). I looked in the <head> tag of these articles, and voilà!, I found the citation tags! The only issue was that none of my sample publications shared a convention for formatting these tags.
Some articles were tagged with the itemprop attribute, some used the name attribute, and others used a property attribute. I just made a simple nested-if clause to catch a few cases; however, I’m sure that there’s a more clever way to do this.
This was a quick way to grab the author name from most websites. I extended this method to the other parts of the MLA citation. Now, my engine could cite any journalistic publication. I even tried a few internet blogs and Medium articles. I was surprised to find that it worked across almost every website! The engine was well on its way to displacing Chegg in my life.
80% done — now for the final 20%
My engine still couldn’t do one important thing: cite scientific papers. As an engineering student, most of the sources I’m citing are scientific, so this is important to me. I tried my engine with a random article from ScienceDirect and ran into errors. BeautifulSoup couldn’t access any of these articles, instead returning a 403 code (meaning the website is forbidden). I peaked in the source code of the ScienceDirect article to see the <head> tags. Surprisingly, the citation components were perfectly arranged and were human-readable. If I could find a way around the 403 code, I would have the perfect citation engine.
After some research, I found that 403 errors sometimes generate when the website cannot determine the source of the request (thank you, Sandy Lin!). After some edits to my BeautifulSoup object creator, I was able to consistently access the ScienceDirect page. I now had an engine that could cite almost any website.
Clean-up and UI
During clean-up, I abstracted some parts of the project to share code between citations of publications, ScienceDirect, and random websites. With my code tidy enough, I created a basic UI with Flask so I could easily access my engine. Finally, I deployed the app with Heroku and shared it with some friends. I shared the Github for the project below, so hopefully you can find some inspiration for your own project. There are still some features to build out:
- APA and Chicago style citation
- Citations on uploaded files and photos
- Improved manual citations through the app
- And as always, the code could be cleaner
I believe in fixing the problems immediately around me. After my engine had circulated for a bit on reddit and in my some Facebook groups, a lot of people messaged me privately to say that they were frustrated by the same problem. In solving my own problem, I ended up helping someone else solve theirs.
Products should remain focused on serving the user first. Chegg allows Ad revenue to drive product design for their citation engines, and that decision hurt my experience on the cite. Without many alternatives, I accepted a bad experience in exchange for access to their resources — until I was fed up. If you’re a product manager, please make sure you’re building something people want to use. If you’re not careful, they might design a competitor and solve the problem themselves!
My Engine: http://www.bettercitation.net
Sources (Yes, I generated them with my engine.)
- Lin, Sandy. [Python][Crawler]“HTTP Error 403: Forbidden”. [Python][Crawler]“HTTP Error 403: Forbidden” | by Sandy Lin | Medium. Medium. 27 Nov. 2020. medium.com/@speedforcerun/python-crawler-http-error-403-forbidden-1623ae9ba0f. Accessed 27 Nov. 2020.