Creating an Asylum Case Application

Jacksoumphonphakdy
5 min readMar 31, 2021

Through Lambda School, I was able to work with and gain experience with a other Lambda students that worked on the Human Rights First — Asylum project. This is a project in which students can get together and work with stakeholders to create a real product. Students get one month to work on a project that past cohorts worked on to try to add on and improve it. Human Rights First is an independent advocacy and action organization that challenges America to live up to its ideals. Specifically, my group was assigned the asylum case in which we build an application to assist immigration attorneys and refugee representatives in advocating for clients in asylum cases by identifying patterns in judicial decisions and predicting possible outcomes based on those patterns.

What was inherited and blockers

We inherited a GitHub repo that was worked on by two previous cohort groups already. The task of the data science team was to take a pdf file and convert the file into text so a scraper could pull keywords out of it. The data science team had a scraper function that pulls out various types of information such as judge name, outcome, the country of origin, hearing locations that was apart of the repo inherited. Along with the massive amount of code we inherited, we also inherited some of the blockers that plagued the previous cohorts. One task presented by the previous cohort was trying to touch up the the accuracy of converting the pdf file to text. Another blocker was the length of time it took to upload a file; it was taking too long. The length of time to upload the file increased with the number of pages in the pdf file. The major blocker was trying to deploy the application onto the web. Some of the dependencies we were using were not getting along with the service we were deploying to.

Improving the OCR

The problem I was trying to solve was trying to increase the accuracy of converting a pdf file to text. The product I used to work on this problem was using an optical character recognition (OCR) program called pytesseract. Pytesseract converts a pdf file into an image, reads the image, and returns it as text. The reason I choose to explore pytesseract was the fact that it was highly rated when I searched for methods of converting pdf to text, and that got me intrigued. This let me approach the task that was inherited, improving the accuracy of the model. One issue the a previous cohort member pointed out was that the ocr wasn’t perfect and it could mistakes some characters. For example, “i” might be returned as “!”, or “m” might be returned as “nn”. When I was doing my research, I noticed the size of the image would greatly change what is returned. When I changed the size to 800, it would almost be accurate but it would still have issues that the previous cohort brought up. After testing different sizes, I ended up not including the size parameter at all. I believe this meant that the image would be 1:1 with the pdf file. I concluded that having the image be pretty blown up, the accuracy will increase.

images1 = convert_from_bytes(open(‘dummy_case2.pdf’, ‘rb’).read(), size=800)

“We review all other issues, including issues of law, discretion, ‘or judgment de novo. 8 C.F.R. § 1003.1(4)3)(i.”

images = convert_from_bytes(open(‘dummy_case2.pdf’, ‘rb’).read())

“We review all other issues, including issues of law, discretion, or judgment de novo. 8 C.F.R. § 1003.1(d)(3)(ii).”

Team Solutions

The other blockers were divided and given to other students. One blocker that needed to be addressed was the uploaded speed. We found out that uploading one page of a pdf file took around 5 seconds to upload. After many discussions the teams decided the role of the upload speed should be looked into from the backend team. The backend team decided to use a “bucket” to store the uploads. A bucket is a type of data buffer or a type of document in which data is divided into regions. This is there specialty so I can’t add that much to it, but this method was able to speed up the uploading process.

The hardest blocker was deploying to the web. Originally the application was to be deployed onto Heroku, but ended up with a couple blockers because of certain dependencies like Poppler. Since this was from other teammates again I can’t go into detail, but this was a cross collaboration between the data science team and a couple members from the backend team. After plaguing the product for months, with the help of other team members, the product was able to be deployed onto Amazons Web Service (AWS)!

This is a screenshot of the live application. You can upload the file and it will be read and turned into text that gets scraped and will pull out specific words

Results

This gif shows the process of uploading a file and grabbing relevant words using AWS.

Our team was able to address a ton of blockers during the past month. We were able to add documentation and improved the accuracy of the OCR and deploy the application onto the web. Some future challenges would be trying to further improve on the accuracy of the scraper. This is going to be challenging because future teams will have to understand the code used to create the scraper and add in and improve using code that is consistent with the code being used. This has to deal with consistency, it wouldn’t look professional if half way into the code, you see the style it is written in changed.

Takeaway

The takeaway from this project was getting first hands experience of making a real world product. This tests your time management; you can be working on an entirely different part of the product but you have to shift gears if you get a request from the shareholders. An example of this in our project was creating visualizations. Originally one member was on that task but it became a task for every member. The feedback helped a lot for me as I am able to bring my point across to nonprogrammers easier than I was before this project.

Some skills I developed was self researching and the need to remember to look at documentation. This allowed me to mess around with the OCR to see what I could and could not do.

The biggest takeaway and skillset is learning to cooperate with team members. Towards the end of the project, cross-collaboration completely came in clutch to getting past the blockers we had. This wouldn’t of happened if our team members stayed quiet about struggling on certain task. Communicating every day was needed to get to the point we are at now.

--

--