Unveiling the Power of AI Models Directly in Your Browser

Neo Fang

Neo Fang

· 6 min read
Unveiling the Power of AI Models Directly in Your Browser

AI is everywhere, from products to technology to applications have very different rich scenarios, ordinary users can most directly feel when it comes to AIGC, including text-to-writing, text-to-map, map-to-map, text-to-video, map-to-video and so on. In order to realize the practical enough AIGC, from the powerful hardware to the large model with rich parameters, from the precise and reasonable algorithm to the efficient and convenient application, one of them is indispensable.
As we know, in the past, AIGC was more on cloud-side servers, and although performance, models, and algorithms are not a problem, one requires a large amount of capital investment, and the other has deficiencies in latency, privacy, and so on.

Therefore, making ordinary PC computers and smartphones capable of running AIGC and even executing it offline may be the future.

The development of large models is also gradually mature, in the pursuit of model accuracy at the same time, also began to miniaturize, lightweight development, and most of the models also support ONNX data format, based on ONNX + WASM + TenserFlow and other capabilities, the front-end can be directly in the browser to run part of the AI model. At the very beginning of the use of ONNX, I can’t stop being excited, and this article will also introduce the related technology, and will also analyze the specific practice process and application scenarios.

ONNX

Onnx Runtime is a cross-platform, high-performance machine learning inference and training gas pedal. Using the tools provided by the Onnx platform, you can take models trained by different deep learning frameworks, convert them to Onnx models, and then utilize different ONNX Runtime to enable the converted Onnx models to run in different platforms.ONNX Runtime works through its extensible Execution Providers program (EP) framework with ONNX Runtime works with different hardware acceleration libraries through its extensible Execution Providers program (EP) framework to optimally execute ONNX models on hardware platforms. This interface gives AP application developers the flexibility to deploy their ONNX models in different environments, both in the cloud and in the Edge, and to leverage the platform’s computational capabilities to optimize execution.

https://onnxruntime.ai/docs/api/

Image

The ONNX Runtime JavaScript API also supports all js scenarios, including NodeJS, web, and React Native.

  1. https://www.npmjs.com/package/onnxruntime-node
  2. https://www.npmjs.com/package/onnxruntime-web
  3. https://www.npmjs.com/package/onnxruntime-react-native

SAM

Background

Backend Engineer: I see the demo of the split running on the SAM side, asking for support
Me: Why do you need front-end segmentation?
Backend engineer: now go server-side segmentation either charge very expensive, or segmentation time is very long. sam sometimes still fail.
Me: since I can save money, I’ll try it~

Coding

The encoding model of SAM does not provide OONX format, and the decoding provides OONX, and the OONX call of decoding can refer to the demo in [5]. mobile SAM provides OONX format for both encoding and decoding models, and mobile SAM is chosen in the application. The specific segmentation results have to wait for the batch evaluation in the subsequent projects.

● The comparison of ViT-based image encoder is summarzed as follows:

Image

● Original SAM and MobileSAM have exactly the same prompt-guided mask decoder:

Image

Import ONNX, introduced the wasm version with the script tag

<script src="https://cdnjs.cloudflare.com/ajax/libs/onnxruntime-web/1.14.0/ort.wasm.min.js"></script>

on-demand

Image

Can i use

Image

Import the oonx model for mobile SAM encoding, decoding

const encoder = 'https://dev.g.alicdn.com/cyberAI/ONNX-SAM/1.0.0/mobilesam.encoder.onnx';
const decoder = 'https://dev.g.alicdn.com/cyberAI/ONNX-SAM/1.0.0/mobilesam.decoder.onnx';

Image encoding

    const resizedTensor = await ort.Tensor.fromImage(img, { resizedWidth: img.width, resizedHeight: img.height });
    const resizeImage = resizedTensor.toImageData();
    let imageDataTensor = await ort.Tensor.fromImage(resizeImage);
    imageImageData = imageDataTensor.toImageData();

    let tf_tensor = tf.tensor(imageDataTensor.data, imageDataTensor.dims);
    tf_tensor = tf_tensor.reshape([3, img.height, img.width]);
    tf_tensor = tf_tensor.transpose([1, 2, 0]).mul(255);
    imageDataTensor = new ort.Tensor(tf_tensor.dataSync(), tf_tensor.shape);

    ort.env.wasm.numThreads = 1;
  
    const session = await ort.InferenceSession.create(encoder);
    console.log("Encoder Session", session);
    const feeds = { "input_image": imageDataTensor };
    let start = Date.now();
    let results;
    try {
      results = await session.run(feeds);

      image_embeddings = results.image_embeddings;
    } catch (error) {}

Image decoding

   const decodingSession = await ort.InferenceSession.create(decoder);
   // 鼠标交互的x、y坐标
    const rect = canvas.getBoundingClientRect();
    const x = event.clientX - rect.left;
    const y = event.clientY - rect.top;
    const pointCoords = new ort.Tensor(new Float32Array([x, y, 0, 0]), [1, 2, 2]);
    const pointLabels = new ort.Tensor(new Float32Array([0, -1]), [1, 2]);
    const maskInput = new ort.Tensor(new Float32Array(256 * 256), [1, 1, 256, 256]);
    const hasMask = new ort.Tensor(new Float32Array([0]));
    const originImageSize = new ort.Tensor(new Float32Array([imageImageData.height, imageImageData.width]));
    ort.env.wasm.numThreads = 1;
    const decodingFeeds = {
      "image_embeddings": image_embeddings,
      "point_coords": pointCoords,
      "point_labels": pointLabels,
      "mask_input": maskInput,
      "has_mask_input": hasMask,
      "orig_im_size": originImageSize
    }
    try {

      const results = await decodingSessionRef.current.run(decodingFeeds);

      const mask = results.masks;
    }catch(e){}

Image segmentation (segmentation area and decoding time still need to be optimized, from SAM’s demo it is possible to be fully real-time)

Image

Memory Usage Analysis

Image

Time includes image sheet quantization -> model loading -> embedding -> decoding.

Transformer.js lets more AI models run in the browser!

These models provided by transformer.js cover a wide range of domains such as natural language processing, computer vision, audio processing, multimodality, etc. There are almost 694 out-of-the-box ONNX models!

Image

DEMO

semantic-sentiment analysis

Image

i18n

Image

Application

Image
  1. Low-Latency real-time scenarios are very suitable for end-side direct processing, such as real-time bone tracking, real-time image processing, and strong interactive scenes.
  2. Privacy Preservation scenarios that require user privacy protection.
  3. Scenes with high availability requirements, even in offline scenarios, can be a stable output.

Summary

With the development of AIGC and cell phone chips, the concept of end intelligence has gradually entered the field of vision, and has also become an important direction for the future development of AI. As a front-end engineer I feel that there are a lot of things that can be done~

Thanks for the read!

Writing has always been my passion, and it gives me the pleasure of helping and inspiring people. If you have any questions, feel free to reach out!

Neo Fang

About Neo Fang

Neo is a Front-end technology expert at Alibaba Cloud, as well as being a co-founder of Tarspace.

Copyright © 2025 Terpampas. All rights reserved.
Powered by Vercel