2

Making my bookshelves clickable

 7 months ago
source link: https://jamesg.blog/2024/02/14/clickable-bookshelves/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Making my bookshelves clickable

Published on February 14, 2024 under the IndieWeb category. Toggle Memex mode

You can make regions of an image clickable with a number of techniques, from overlyaing an SVG that contains onclick JavaScript handlers all the way to using image maps. I love this idea. I started to think about how I could create an image of my bookshelves that you could click to learn more about each book I am reading. This would be more engaging than a traditional list of text.

I built a script that takes in an image of a bookshelf and makes each book clickable. When you click on a book, you are taken to the Google Books page associated with the book. You do not need to manually annotate any book or map each book to its title. This happens automatically. You can try a demo of a clickable bookshelf on GitHub.

Here is a video of my clicking through different books on my bookshelf:

Your browser does not support the video tag.

The red border indicates the polygon whose contents are clickable.

In this blog post, I am going to discuss how I made this project. This post uses computer vision, but I will do my best to explain all jargon. You shouldn't need a computer vision background to enjoy this post. If any details do not make sense, email me at readers [at] jamesg [dot] blog. If you want to learn how to use this tool yourself, refer to the project GitHub repository setup instructions.

Without further ado, let's get started!

The problem and solution

The problem: I wanted to make an image of bookshelves clickable.

How could I go about addressing this problem? Here are the steps I had in mind when I started to work on this project:

  1. Isolate the region of each book in the image.
  2. Retrieve the title for each book using Optical Character Recognition (OCR).
  3. Retrieve the Google Books URL for each book.
  4. Map each URL to its respective region.
  5. Create an SVG that can be overlaid onto an image.

Let's talk through each of these steps.

Isolating book regions

We need to know where books are in an image before we can make them clickable. We could manually annotate each book. I made a tool for manually drawing polygons called PolygonZone that you can use to manually annotate regions. But, I wanted to make a solution that is automatic. For that, I needed a computer vision model.

For this project, I decided to use a combination of two models: Grounding DINO and Segment Anything (SAM). The combination is called Grounded SAM.

If you don't have any experience in computer vision, stay with me!

Grounding DINO lets you identify objects in images. You can give Grounding DINO a text prompt (i.e. "book spine") and the model will try to identify all instances of that object in an image. Here is an example of the result from Grounding DINO when passing through an image of my bookshelf:

Book bounding boxes

There is a box around (most of) the books in the image.

This is great! We now know where each book is. But, each box is larger than the book it represents. This is because every book is angled in the image. We could use these boxes to make each book clickable, but some regions would overlap. This would be confusing and unintuitive.

We can use a segmentation model to identify the exact region of each book. This is where the Segment Anything Model (SAM) comes in. We can use SAM to retrieve masks for each book. You can convert masks into polygons to get the outline of an object.

Here is an example of the bookshelf processed with Grounding DINO then SAM:

Book segmentation masks

The purple regions are polygons. If you look closely, you can see boundaries between each book that are not in purple. This shows our model is segmenting individual books.

There are some regions highlighted that are not books. These regions do not have text in them. Thus, GPT will not be able to find data for them. We can only plot polygons for which we can retrieve a title to ensure that only relevant polygons are displayed in the output. In addition, a few books are not highlighted. This means the model we are using -- a combination of Grounding DINO and SAM -- could not isolate a region for the book. This could be manually corrected using a polygon annotation tool, but is not ideal. I need to think through what solution would be easiest for users.

The process of generating masks takes ~15 seconds on an M1 Macbook Air.

Retrieving book titles

We now know where our books are in an image. Next, we need to figure out the title and author of each book. This involves a few steps. First, we need to isolate each book. Then, we need to read the characters on each book. At minimum, we should get the title of a book. We may also get the author name, depending on if the author name is on the spine. We can then use this information to search for a book on Google Books.

Reading characters in an image is a domain called Optical Character Recognition (OCR). There are many ways to do OCR, but for this project I chose to use GPT-4 with Vision, which has been accurate in many OCR tests I have run and seen run. GPT-4 with Vision allows you to ask questions about images. In this case, I could request the model identify the characters in each book image.

Before sending an image to GPT-4 with Vision, I isolated the region of each book. I then rotated the book to the left by 90 degrees so it would be horizontal instead of vertical. This should boost OCR performance. Here is an example of an image sent to GPT-4 with Vision:

Isolated book

In this image, one specific book is isolated. We can send this image to GPT-4 with Vision to retrieve the characters on the book.

I used the following prompt with the image:

Read the text on the book spine. Only say the book cover title and author if you can find them. Say the book that is most prominent. Return the format [title] [author], with no punctuation.

Here is an example response:

The Poetry Pharmacy Forever William Sieghart

With this information, we can look up the book with the Google Books API. The Google Books search API uses the following syntax:


https://www.googleapis.com/books/v1/volumes?q={book}
Plain text

You can add any text in the {book} section above. In this script, we send the book name and, if available, the author name. I didn't separate them out. Including both pieces of information seemed to work well.

This API returns several pieces of information about a book. For my script, I gathered the:

  • Author name
  • Google Books URL

Here is an example books URL for the Google Books listing URL for the book matching The Poetry Pharmacy Forever William Sieghart:


https://play.google.com/store/books/details?id=vdOXDwAAQBAJ&source=gbs_api
Plain text

This entire process -- calling the GPT-4 with Vision and Google Books API -- takes a few seconds per book.

Create a clickable SVG

All of the information collected with GPT-4 with Vision and the Google Books search API is associated with each book and region in the image. Each mask -- the form returned by Segment Anything -- is converted to a polygon so it can be used in an SVG that I can overlay over my image. Using these polygons, I can generate a HTML file with two components:

  1. The source image, and;
  2. An SVG file that can be overlaid over the image.

The SVG can include JavaScript. For this project, I have an onclick handler that opens the Google Books URL associated with each book.

I generate a HTML file with an SVG. In the HTML file, I embed my source image and overlay the SVG. The SVG uses polygons to represent each book region. Each polygon typically has dozens of points. The onclick handler redirects the user to the corresponding Google Books page when a book is clicked. Here is a screenshot of the resulting page (the books are not clickable because this is a screenshot):

Result

You can try the demo -- and click the books! -- on GitHub pages.

Conclusion and Next Steps

My system to make clickable bookshelves is designed to be autonomous. You should be able to upload an arbitrary bookshelf and generate clickable regions as above. With that said, there are limitations. If a book title is hard to read, the GPT-4 with Vision API may struggle to run OCR. Thus, you will not be able to associate a region with a book URL. If a book is not on Google Books, you would need to use another URL. In one test, a book URL was entirely wrong because the book wasn't available on Google Books.

Of course, Google Books can be swapped with any source. If you have a blog, the source of URLs could be your blog. You could make each book clickable and take the user to your review of the book.

There are a few improvements I have in mind that I would like to make:

  1. Use EfficientSAM, a faster version of SAM.
  2. Make the polygons look nicer.
  3. Maybe add a manual correction system so if the system can't read the text of a book I can fix it.

If you have questions about this project, email me at readers [at] jamesg [dot] blog.

Tagged in IndieWeb.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK