文字点选验证码的破解方法~

Python爬虫与数据挖掘

共 2614字，需浏览 6分钟

· 2021-12-01

点击上方“Python爬虫与数据挖掘”，进行关注

回复“书籍”即可获赠Python从入门到进阶共10本电子书

今

日

鸡

汤

荷笠带斜阳，青山独归远。

作者简介：张老师，学习计算机十余年，在信息安全、生物信息学、会计、平面设计、编辑出版等多个领域均有涉猎，热爱钻研、热爱考证、热爱生活。

大家好，我是志斌~

志斌之前一直在写反爬虫系列的文章，但是因为自身水平有限，所以一直没更验证码反爬虫之文字点选验证码反爬虫的解决方式，这次专门为大家找了一个大佬——张老师，来跟大家分享一下他解决文字点选验证码的方法~

基本思路

获取图像
对图像进行二值化处理
识别图中轮廓
识别轮廓文字
构建待选文字图像
比较识别文字与待选文字图像
返回点选结果

本文仅提供基本思路，具体应用需根据各类型点选验证码自行修改。

实现过程

0.导入依赖

import base64
import json
import cv2
import numpy as np
import requests
from PIL import Image, ImageDraw, ImageFont

获取图像

第一步是获取图像，有一些验证码图像是以二进制形式返回的，本文测试的图像是以base64编码字符串的形式返回，因此需要对其进行解码。

def getData(capsession: requests.Session):
    resp = s.post("验证码获取url")
    return resp.json()["repData"]

def getImageFromBase64(b64):
    buffer = base64.b64decode(b64)
    nparr = np.frombuffer(buffer, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    return image

2.对图像进行二值化处理

为了方便识别文字轮廓，我们对图像进行二值化处理。

def normalizeImage(img):
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, img = cv2.threshold(img, 1, 255, cv2.THRESH_BINARY)
    img = cv2.bitwise_not(img)
    return img

3.识别图中轮廓

在识别图中轮廓时，为了提高效率和准确度，我们对轮廓按长宽比和面积进行筛选，尽可能保证轮廓能够满足文字识别需要。

def findContour(img):
    contours, _ = cv2.findContours(img, cv2.RETR_EXTERNAL,
                                   cv2.CHAIN_APPROX_SIMPLE)
    def find_if_close(cnt1, cnt2):
        row1, row2 = cnt1.shape[0], cnt2.shape[0]
        for i in range(row1):
            for j in range(row2):
                dist = np.linalg.norm(cnt1[i] - cnt2[j])
                if abs(dist) < 5:
                    return True
                elif i == row1 - 1 and j == row2 - 1:
                    return False
    LENGTH = len(contours)
    status = np.zeros((LENGTH, 1))
    for i, cnt1 in enumerate(contours):
        x = i
        if i != LENGTH - 1:
            for j, cnt2 in enumerate(contours[i + 1:]):
                x = x + 1
                dist = find_if_close(cnt1, cnt2)
                if dist == True:
                    val = min(status[i], status[x])
                    status[x] = status[i] = val
                else:
                    if status[x] == status[i]:
                        status[x] = i + 1
    unified = []
    maximum = int(status.max()) + 1
    for i in range(maximum):
        pos = np.where(status == i)[0]
        if pos.size != 0:
            cont = np.vstack([contours[i] for i in pos])
            hull = cv2.convexHull(cont)
            unified.append(hull)
    cnt = list(filter(aspectRatio,unified))
    cnt.sort(key=cv2.contourArea,reverse=True)
    return cnt[:4]

def aspectRatio(cnt):
    _,_,w,h = cv2.boundingRect(cnt)
    return (0.61.7) and (cv2.contourArea(cnt)>200.0)

4.识别轮廓文字

我们对图中文字轮廓进行识别，返回文字轮廓与相应的坐标位置。

def extractCharContour(img, contour):
    mult = 1.2
    ret = []
    point = []
    for cnt in contour:
        rect = cv2.minAreaRect(cnt)
        box = cv2.boxPoints(rect)
        box = np.int0(box)
        W = rect[1][0]
        H = rect[1][1]
        Xs = [i[0] for i in box]
        Ys = [i[1] for i in box]
        x1 = min(Xs)
        x2 = max(Xs)
        y1 = min(Ys)
        y2 = max(Ys)
        rotated = False
        angle = rect[2]
        if angle < -45:
            angle += 90
            rotated = True
        center = (int((x1 + x2) / 2), int((y1 + y2) / 2))
        size = (int(mult * (x2 - x1)), int(mult * (y2 - y1)))
        try:
            M = cv2.getRotationMatrix2D((size[0] / 2, size[1] / 2), angle,
                                        1.0)
            cropped = cv2.getRectSubPix(img, size, center)
            cropped = cv2.warpAffine(cropped, M, size)
            croppedW = W if not rotated else H
            croppedH = H if not rotated else W
            croppedRotated = cv2.getRectSubPix(
                    cropped,
                    (int(croppedW * mult), int(croppedH * mult)),
                    (size[0] / 2, size[1] / 2),
            )
            im = cv2.resize(croppedRotated, (20, 20))
            kernel = np.array([[0, -1, 0], [-1, 5, -1], [0, -1, 0]],
                                  np.float32)
            im = cv2.filter2D(im, -1, kernel=kernel)
            ret.append(im)
            point.append((rect[0][0], rect[0][1]))
        except:
            pass
    return ret, point

5.构建待选文字图像

将获取的待选文字转化为图像。

def genCharacter(ch, size):
    img = Image.new("L", size, 0)
    font = ImageFont.truetype("simsun.ttc", min(size))
    draw = ImageDraw.Draw(img)
    draw.text((0, 0), ch, font=font, fill=255)
    return np.asarray(img)

6.比较识别文字与待选文字图像

我们将识别文字轮廓与待选文字图像进行比较，获得对应位置，并根据验证码要求依次添加点选坐标。

def compareCharImage(words, chars, point, word_list) 
    scores = []
    for i, word in enumerate(words):
        for j, char in enumerate(chars):
            scores.append(((i, j), cv2.bitwise_xor(char, word).sum()))
    scores.sort(key=lambda x: x[1])
    word_set = set()
    char_set = set()
    answers = {}
    for score in scores:
        if (score [0][0] not in word_set) and (score [0][1] not in char_set):
            continue
        word_set.add(score[0][0])
        char_set.add(score[0][1])
        answers[word_list[score[0][0]]] = point[score[0][1]]
    return [{
        "x": int(answers[word][0]),
        "y": int(answers[word][1])
    } for word in word_list]

7.返回点选结果

最后，将得到的点选结果返回给服务器进行验证。

def checkCaptcha(captchaSession: requests.Session, data, point):
    enc = encrypt(json.dumps(point).replace(" ", ""), data["secretKey"])
    resp = captchaSession.post(
        "验证码提交url",
        json={
            "token": data["token"],
            "pointJson": enc,
        },
    )
    return resp.json()